{% load static %}
COURSE: DTSC 691
TERM: SU2 2023
NAME: Melissa Buz
The overall project goal is finding the frequency of UFO sightings based on year, city, and country. I will address the following questions in this project:
1. What is the frequency of UAP sightings based on year, city, and country?
2. How well do the machine learning models that are built able to predict the frequency of UAP sightings?

I have been curious about UFO/UAP sightings for a while. However, up until 2021, I relegated these events into the science fiction realm. Several cases, however, were puzzling, e.g., the USS Nimitz case (2004)1, the Westhall school case in Australia (1966)2, the Ariel School case in Zimbabwe (1994)3, and more cases occurring in schools4 with up to 60-200 witnesses. These UAP sightings deserve closer examination now with the recent government release of video footage and reports of UAPs (see Supplementary Media below). The congressional testimony of whistleblowers in June-July 2023,5, 6 further drives my interest in the topic.
Special note: The acronym UAP (unidentified aerial phenomena) is now used by the US military and considered more current than the UFO (unidentified flying object) acronym. I use both terms interchangeably in this project.
The UFO sightings dataset is called nuforc_reports.csv, which are comprised of reports collected by the National UFO Research Center (NUFORC) at https://nuforc.org/. The public can enter reports at will through the website. The NUFORC is a "non-profit Washington state" organization founded in 1974 and has been collecting reports for "the past five decades."7 It has "processed over 170,000 UFO reports."8 Its mission is to "record, and to the greatest degree possible, corroborate and document reports from individuals who have been witness to unusual, possibly UFO-related events."9 What distinguishes NUFORC from other UFO organizations is that it makes its reports available to the public.10 According to MentalFloss: "The NUFORC doesn’t question the veracity of the reports, and it doesn’t always speculate about possible causes, either."11 Yet the NUFORC does "encourage a healthy skepticism" on reading the reports on its site.12
To explore the insights that the UFO dataset may reveal, I formulated project goals and supplementary questions, see Project Goals above. I will be building a timeseries model and regression models to predict the frequency of UAP sightings. Please note that this project's goals are not to lend greater credibility to, or to debunk, these UAP reports. I will make comments and notes within the different sections of the notebook, but the full analyses treatment is given in the accompanying Word file, DTSC691ProjectAnalyses.docx. There is also a web file, DTSC691ProjectAnalyses.html, that is a presentation-like document. In both files, I chose the most illuminating tables, charts, and graphs to show my findings.
Find the data file at the data.world site here: https://data.world/timothyrenner/ufo-sightings. The Github link to access files: https://github.com/timothyrenner/nuforc_sightings_data. This data file was "assembled by scraping the reports from the NUFORC site (done with scrapy), then merged with a city location database to perform the geocoding."13 Further, the csv file has its state and city fields "cleaned", and the date, latitude, and longitude fields have been "standardized."14
For more information on this dataset, see
The dataset of UAP reports has a total of 141261 observations/rows of data and 13 columns, specifically, summary, country, city, state, date_time, shape, duration, stats, report_link, text, posted, city_latitude, city_longitude. 36097 records/rows have missing values in one or more columns. There are further problems of removing hoax reports, if feasible.
The features contained in columns that I will be using are city, state, country, date_time, shape, duration, city_latitude, and city_longitude. The other column features such as summary, stats, report_link, text, posted are longer string data that, while providing details about the UFO events, are not feasible as training data. In other words, summary and text data are too complicated and exhaustive for the scope of this project.
Dunn, Matthew. "The Westall ‘UFO’ incident still remains a mystery 50 years after it occurred." News.Com.Au, 16 Jan. 2016. https://www.news.com.au/technology/science/space/the-westall-ufo-incident-still-remains-a-mystery-50-years-after-it-occurred/news-story/f002a9da51358af2e9aca4b3d5cd81de Accessed 7 Jul. 2023.
Quinn, Shannon. "10 Alleged UFO Sightings Witnessed By Students At School." ListVerse, 16 Feb. 2018, listverse.com/2018/02/16/10-alleged-ufo-sightings-witnessed-by-students-at-school/. Accessed 27 Jul. 2023.
Harvey, Austin. "The Ariel School Phenomenon, When Dozens Of Schoolchildren In Zimbabwe Claimed They Met Aliens." ATI, 11 May 2023, allthatsinteresting.com/ariel-school-phenomenon#:~:text=In%20September%201994%2C%20a%20strange%20story%20emerged%20from,came%20to%20be%20known%20as%20the%20Ariel%20Phenomenon. Accessed 28 Jul. 2023.
Gabbatt, Adam. "US Urged to Reveal UFO Evidence After Claim That It Has Intact Alien Vehicles." The Guardian, 6 Jun. 2023, www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft. Accessed 11 Jul. 2023.
import numpy as np
import pandas as pd
import warnings # For ignoring warnings
import re # For fixing values in country and date_time columns
import us # For finding whether a state value is a US state name
import math # For fixing values in duration column
import matplotlib.pyplot as plt # For plotting graphs
import folium # For mapping
import pytz # For determining epoch time
import datetime as dtime # For timestamping datetime objects
from datetime import datetime # For getting current date and retrieving date time values
import plotly.express as px # For heatmap
import geocoder # For mapping with latitude and longitude
from geopy.geocoders import GoogleV3 # For getting city latitude and longitude
from prophet import Prophet
from prophet.diagnostics import cross_validation, performance_metrics
from prophet.plot import plot_cross_validation_metric
import itertools # For more cross-val of Prophet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.inspection import permutation_importance # feature selection
from sklearn.feature_selection import r_regression, SelectKBest # feature selection
from sklearn.feature_selection import f_classif, chi2, f_regression # feature selection
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import KFold, cross_validate, GridSearchCV # model fine-tuning
warnings.filterwarnings("ignore") # ignore all warnings
# read the data file into ufo_df dataframe
ufo_df = pd.read_csv("nuforc_reports.csv")
ufo_df.head(7)
| summary | country | city | state | date_time | shape | duration | stats | report_link | text | posted | city_latitude | city_longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MADAR Node 100 | USA | Mountlake Terrace | WA | 2019-06-23T18:53:00 | NaN | NaN | Occurred : 6/23/2019 18:53 (Entered as : 06/2... | http://www.nuforc.org/webreports/reports/147/S... | MADAR Node 100 | 2019-06-27T00:00:00 | 47.794100 | -122.306600 |
| 1 | Steady flashing object with three lights hover... | USA | Hamden | CT | 2019-06-23T20:00:00 | light | 5 hours | Occurred : 6/23/2019 20:00 (Entered as : 6/23... | http://www.nuforc.org/webreports/reports/147/S... | Steady flashing object with three lights hover... | 2019-06-27T00:00:00 | 41.373941 | -72.921325 |
| 2 | Group of several orange lights, seemingly circ... | USA | Charlottesville | VA | 2019-06-20T23:28:00 | circle | 15 seconds | Occurred : 6/20/2019 23:28 (Entered as : 06/2... | http://www.nuforc.org/webreports/reports/146/S... | Group of several orange lights, seemingly circ... | 2019-06-27T00:00:00 | 38.055968 | -78.494482 |
| 3 | Dropped in flashed a few times and shot off 5 ... | USA | Lincoln Park | MI | 2019-06-21T00:00:00 | light | 2 minutes | Occurred : 6/21/2019 00:00 (Entered as : 06/2... | http://www.nuforc.org/webreports/reports/146/S... | Dropped in flashed a few times and shot off 5 ... | 2019-06-27T00:00:00 | 42.238500 | -83.178300 |
| 4 | Location: While traveling in a TGV, from Lille... | France | Douai (France) | NaN | 2019-06-07T20:00:00 | cigar | 5 minutes | Occurred : 6/7/2019 20:00 (Entered as : 06/07... | http://www.nuforc.org/webreports/reports/146/S... | Location: While traveling in a TGV, from Lill... | 2019-06-27T00:00:00 | NaN | NaN |
| 5 | Llike a star at first glance, got brighter and... | USA | San jacinto | CA | 2019-07-06T00:30:00 | sphere | 25 seconds | Occurred : 7/6/2019 00:30 (Entered as : 7/6/1... | http://www.nuforc.org/webreports/reports/147/S... | Looked like a star at first glance, got bright... | 2019-07-12T00:00:00 | 33.794093 | -116.949989 |
| 6 | Light in the sky moving from south to north wi... | USA | Otis Orchards | WA | 2019-07-06T02:00:00 | light | 45 minutes | Occurred : 7/6/2019 02:00 (Entered as : 07/6/... | http://www.nuforc.org/webreports/reports/147/S... | Light in the sky moving from south to north wi... | 2019-07-12T00:00:00 | 47.695900 | -117.107800 |
# dataset has 13 features or columns with many missing values; there are two columns, stats and report_link, with 141261 non-null values
ufo_df.info
<bound method DataFrame.info of summary country \
0 MADAR Node 100 USA
1 Steady flashing object with three lights hover... USA
2 Group of several orange lights, seemingly circ... USA
3 Dropped in flashed a few times and shot off 5 ... USA
4 Location: While traveling in a TGV, from Lille... France
... ... ...
141256 See craft every evening when I get out of work... USA
141257 MADAR Node 70 USA
141258 A very small white light hovering above the cl... USA
141259 I was young. You know what? It was pretty fuck... USA
141260 While driving at night, I watched two blue-gre... USA
city state date_time shape duration \
0 Mountlake Terrace WA 2019-06-23T18:53:00 NaN NaN
1 Hamden CT 2019-06-23T20:00:00 light 5 hours
2 Charlottesville VA 2019-06-20T23:28:00 circle 15 seconds
3 Lincoln Park MI 2019-06-21T00:00:00 light 2 minutes
4 Douai (France) NaN 2019-06-07T20:00:00 cigar 5 minutes
... ... ... ... ... ...
141256 Branford CT 2021-12-19T23:00:00 circle NaN
141257 St Louis MO 2022-03-09T00:48:00 NaN NaN
141258 Batavia IL 2013-04-30T02:19:00 circle 45 seconds
141259 Liberty Lake WA 1995-07-21T09:00:00 circle 1/12
141260 Fredericksburg VA 2019-12-24T18:30:00 sphere 4 seconds
stats \
0 Occurred : 6/23/2019 18:53 (Entered as : 06/2...
1 Occurred : 6/23/2019 20:00 (Entered as : 6/23...
2 Occurred : 6/20/2019 23:28 (Entered as : 06/2...
3 Occurred : 6/21/2019 00:00 (Entered as : 06/2...
4 Occurred : 6/7/2019 20:00 (Entered as : 06/07...
... ...
141256 Occurred : 12/19/2021 23:00 (Entered as : 12/...
141257 Occurred : 3/9/2022 00:48 (Entered as : 03/09...
141258 Occurred : 4/30/2013 02:19 (Entered as : 04/3...
141259 Occurred : 7/21/1995 09:00 (Entered as : 07/2...
141260 Occurred : 12/24/2019 18:30 (Entered as : 012...
report_link \
0 http://www.nuforc.org/webreports/reports/147/S...
1 http://www.nuforc.org/webreports/reports/147/S...
2 http://www.nuforc.org/webreports/reports/146/S...
3 http://www.nuforc.org/webreports/reports/146/S...
4 http://www.nuforc.org/webreports/reports/146/S...
... ...
141256 http://www.nuforc.org/webreports/reports/165/S...
141257 http://www.nuforc.org/webreports/reports/167/S...
141258 http://www.nuforc.org/webreports/reports/097/S...
141259 http://www.nuforc.org/webreports/reports/151/S...
141260 http://www.nuforc.org/webreports/reports/146/S...
text \
0 MADAR Node 100
1 Steady flashing object with three lights hover...
2 Group of several orange lights, seemingly circ...
3 Dropped in flashed a few times and shot off 5 ...
4 Location: While traveling in a TGV, from Lill...
... ...
141256 See craft every evening when I get out of work...
141257 MADAR Node 70 \n \n
141258 A very small white light hovering above the cl...
141259 I was young. You know what? It was pretty ((...
141260 While driving at night, I watched two blue-gre...
posted city_latitude city_longitude
0 2019-06-27T00:00:00 47.794100 -122.306600
1 2019-06-27T00:00:00 41.373941 -72.921325
2 2019-06-27T00:00:00 38.055968 -78.494482
3 2019-06-27T00:00:00 42.238500 -83.178300
4 2019-06-27T00:00:00 NaN NaN
... ... ... ...
141256 2021-12-19T00:00:00 41.277600 -72.811500
141257 2022-04-22T00:00:00 38.626989 -90.307029
141258 2013-05-15T00:00:00 41.847800 -88.307500
141259 2019-12-19T00:00:00 47.651700 -117.083800
141260 2019-05-14T00:00:00 38.299511 -77.500674
[141261 rows x 13 columns]>
# dataset has only two columns with numeric (float) data
ufo_df.dtypes
summary object country object city object state object date_time object shape object duration object stats object report_link object text object posted object city_latitude float64 city_longitude float64 dtype: object
ufo_df.describe()
| city_latitude | city_longitude | |
|---|---|---|
| count | 115440.000000 | 115440.000000 |
| mean | 38.697147 | -95.102130 |
| std | 5.736872 | 18.123691 |
| min | -31.941922 | -176.627400 |
| 25% | 34.223597 | -112.975200 |
| 50% | 39.273300 | -89.081250 |
| 75% | 42.352576 | -80.332976 |
| max | 70.639300 | 130.879729 |
# store descriptive stats of city latitude and longitude for analysis later in Supplemental Questions section
stats_lat_long = ufo_df.describe()
city_lat_max = round(ufo_df["city_latitude"].max())
city_long_max = round(ufo_df["city_longitude"].max())
city_lat_min = round(ufo_df["city_latitude"].min())
city_long_min = round(ufo_df["city_longitude"].min())
# plot the graph of ufo_df, which only show two columns' values: city_latitude and city_longitude
ufo_df.plot()
plt.show()
# checking if 2023 reports present
ufo_df[ufo_df["date_time"].str.contains("2023", na=False)]
| summary | country | city | state | date_time | shape | duration | stats | report_link | text | posted | city_latitude | city_longitude |
|---|
ufo_df.isnull().any(axis=1).sum()
ufo_df[ufo_df.isnull().any(axis=1)]
| summary | country | city | state | date_time | shape | duration | stats | report_link | text | posted | city_latitude | city_longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MADAR Node 100 | USA | Mountlake Terrace | WA | 2019-06-23T18:53:00 | NaN | NaN | Occurred : 6/23/2019 18:53 (Entered as : 06/2... | http://www.nuforc.org/webreports/reports/147/S... | MADAR Node 100 | 2019-06-27T00:00:00 | 47.794100 | -122.306600 |
| 4 | Location: While traveling in a TGV, from Lille... | France | Douai (France) | NaN | 2019-06-07T20:00:00 | cigar | 5 minutes | Occurred : 6/7/2019 20:00 (Entered as : 06/07... | http://www.nuforc.org/webreports/reports/146/S... | Location: While traveling in a TGV, from Lill... | 2019-06-27T00:00:00 | NaN | NaN |
| 9 | MADAR Node 128 | USA | Helena | MT | 2019-07-01T14:48:00 | NaN | NaN | Occurred : 7/1/2019 14:48 (Entered as : 07/01... | http://www.nuforc.org/webreports/reports/147/S... | MADAR Node 128 | 2019-07-12T00:00:00 | 46.627429 | -112.012735 |
| 12 | While leaving a bar/restaurant situated on the... | USA | South Kingstown | WA | 2019-07-15T01:15:00 | unknown | 25 seconds | Occurred : 7/15/2019 01:15 (Entered as : 07/1... | http://www.nuforc.org/webreports/reports/147/S... | While leaving a bar/restaurant situated on the... | 2019-07-25T00:00:00 | NaN | NaN |
| 16 | ((HOAX??)) On Google earth, found underwater ... | USA | NaN | CA | NaN | NaN | NaN | Occurred : Reported: 7/24/2019 1:45:33 PM 13... | http://www.nuforc.org/webreports/reports/147/S... | I was on Google earth and found a underwater b... | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 141252 | Saw bright moving objects in mid day sun, clea... | Mexico | Cabo San Lucas | NaN | 2021-12-15T11:30:00 | light | NaN | Occurred : 12/15/2021 11:30 (Entered as : 12/... | http://www.nuforc.org/webreports/reports/165/S... | Saw bright moving objects in mid day sun, clea... | 2021-12-19T00:00:00 | NaN | NaN |
| 141254 | Two lights spinning around each other . See vi... | Ireland | Cappagh | COUNTY WATERFORD | 2021-12-15T17:30:00 | light | >30 minutes | Occurred : 12/15/2021 17:30 (Entered as : 12/... | http://www.nuforc.org/webreports/reports/165/S... | Two lights spinning around each other . See vi... | 2021-12-19T00:00:00 | NaN | NaN |
| 141255 | I have video and digital pictures | USA | Bella Vista | NaN | 2021-12-15T19:00:00 | formation | 5 minutes | Occurred : 12/15/2021 19:00 (Entered as : 12/... | http://www.nuforc.org/webreports/reports/165/S... | I have video and digital pictures \n \n45000 f... | 2021-12-19T00:00:00 | NaN | NaN |
| 141256 | See craft every evening when I get out of work... | USA | Branford | CT | 2021-12-19T23:00:00 | circle | NaN | Occurred : 12/19/2021 23:00 (Entered as : 12/... | http://www.nuforc.org/webreports/reports/165/S... | See craft every evening when I get out of work... | 2021-12-19T00:00:00 | 41.277600 | -72.811500 |
| 141257 | MADAR Node 70 | USA | St Louis | MO | 2022-03-09T00:48:00 | NaN | NaN | Occurred : 3/9/2022 00:48 (Entered as : 03/09... | http://www.nuforc.org/webreports/reports/167/S... | MADAR Node 70 \n \n | 2022-04-22T00:00:00 | 38.626989 | -90.307029 |
36097 rows × 13 columns
# the total number of NaN values
ufo_df.isnull().sum().sum()
81358
# the columns that have NaN values: 11 columns out of 13
# two columns have with values in all rows (no null values): stats and report_link
ufo_df.isnull().any()
summary True country True city True state True date_time True shape True duration True stats False report_link False text True posted True city_latitude True city_longitude True dtype: bool
# the total rows of data with missing values in city column: 478
ufo_df[ufo_df["city"].isnull()].shape[0]
478
# display all columns with NaN values: 11 columns out of 13
ufo_df[ufo_df.columns[ufo_df.isna().any()]]
| summary | country | city | state | date_time | shape | duration | text | posted | city_latitude | city_longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MADAR Node 100 | USA | Mountlake Terrace | WA | 2019-06-23T18:53:00 | NaN | NaN | MADAR Node 100 | 2019-06-27T00:00:00 | 47.794100 | -122.306600 |
| 1 | Steady flashing object with three lights hover... | USA | Hamden | CT | 2019-06-23T20:00:00 | light | 5 hours | Steady flashing object with three lights hover... | 2019-06-27T00:00:00 | 41.373941 | -72.921325 |
| 2 | Group of several orange lights, seemingly circ... | USA | Charlottesville | VA | 2019-06-20T23:28:00 | circle | 15 seconds | Group of several orange lights, seemingly circ... | 2019-06-27T00:00:00 | 38.055968 | -78.494482 |
| 3 | Dropped in flashed a few times and shot off 5 ... | USA | Lincoln Park | MI | 2019-06-21T00:00:00 | light | 2 minutes | Dropped in flashed a few times and shot off 5 ... | 2019-06-27T00:00:00 | 42.238500 | -83.178300 |
| 4 | Location: While traveling in a TGV, from Lille... | France | Douai (France) | NaN | 2019-06-07T20:00:00 | cigar | 5 minutes | Location: While traveling in a TGV, from Lill... | 2019-06-27T00:00:00 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 141256 | See craft every evening when I get out of work... | USA | Branford | CT | 2021-12-19T23:00:00 | circle | NaN | See craft every evening when I get out of work... | 2021-12-19T00:00:00 | 41.277600 | -72.811500 |
| 141257 | MADAR Node 70 | USA | St Louis | MO | 2022-03-09T00:48:00 | NaN | NaN | MADAR Node 70 \n \n | 2022-04-22T00:00:00 | 38.626989 | -90.307029 |
| 141258 | A very small white light hovering above the cl... | USA | Batavia | IL | 2013-04-30T02:19:00 | circle | 45 seconds | A very small white light hovering above the cl... | 2013-05-15T00:00:00 | 41.847800 | -88.307500 |
| 141259 | I was young. You know what? It was pretty fuck... | USA | Liberty Lake | WA | 1995-07-21T09:00:00 | circle | 1/12 | I was young. You know what? It was pretty ((... | 2019-12-19T00:00:00 | 47.651700 | -117.083800 |
| 141260 | While driving at night, I watched two blue-gre... | USA | Fredericksburg | VA | 2019-12-24T18:30:00 | sphere | 4 seconds | While driving at night, I watched two blue-gre... | 2019-05-14T00:00:00 | 38.299511 | -77.500674 |
141261 rows × 11 columns
# drop extraneous columns
ufo_df = ufo_df.drop(columns=["report_link","posted"])
ufo_df
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MADAR Node 100 | USA | Mountlake Terrace | WA | 2019-06-23T18:53:00 | NaN | NaN | Occurred : 6/23/2019 18:53 (Entered as : 06/2... | MADAR Node 100 | 47.794100 | -122.306600 |
| 1 | Steady flashing object with three lights hover... | USA | Hamden | CT | 2019-06-23T20:00:00 | light | 5 hours | Occurred : 6/23/2019 20:00 (Entered as : 6/23... | Steady flashing object with three lights hover... | 41.373941 | -72.921325 |
| 2 | Group of several orange lights, seemingly circ... | USA | Charlottesville | VA | 2019-06-20T23:28:00 | circle | 15 seconds | Occurred : 6/20/2019 23:28 (Entered as : 06/2... | Group of several orange lights, seemingly circ... | 38.055968 | -78.494482 |
| 3 | Dropped in flashed a few times and shot off 5 ... | USA | Lincoln Park | MI | 2019-06-21T00:00:00 | light | 2 minutes | Occurred : 6/21/2019 00:00 (Entered as : 06/2... | Dropped in flashed a few times and shot off 5 ... | 42.238500 | -83.178300 |
| 4 | Location: While traveling in a TGV, from Lille... | France | Douai (France) | NaN | 2019-06-07T20:00:00 | cigar | 5 minutes | Occurred : 6/7/2019 20:00 (Entered as : 06/07... | Location: While traveling in a TGV, from Lill... | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 141256 | See craft every evening when I get out of work... | USA | Branford | CT | 2021-12-19T23:00:00 | circle | NaN | Occurred : 12/19/2021 23:00 (Entered as : 12/... | See craft every evening when I get out of work... | 41.277600 | -72.811500 |
| 141257 | MADAR Node 70 | USA | St Louis | MO | 2022-03-09T00:48:00 | NaN | NaN | Occurred : 3/9/2022 00:48 (Entered as : 03/09... | MADAR Node 70 \n \n | 38.626989 | -90.307029 |
| 141258 | A very small white light hovering above the cl... | USA | Batavia | IL | 2013-04-30T02:19:00 | circle | 45 seconds | Occurred : 4/30/2013 02:19 (Entered as : 04/3... | A very small white light hovering above the cl... | 41.847800 | -88.307500 |
| 141259 | I was young. You know what? It was pretty fuck... | USA | Liberty Lake | WA | 1995-07-21T09:00:00 | circle | 1/12 | Occurred : 7/21/1995 09:00 (Entered as : 07/2... | I was young. You know what? It was pretty ((... | 47.651700 | -117.083800 |
| 141260 | While driving at night, I watched two blue-gre... | USA | Fredericksburg | VA | 2019-12-24T18:30:00 | sphere | 4 seconds | Occurred : 12/24/2019 18:30 (Entered as : 012... | While driving at night, I watched two blue-gre... | 38.299511 | -77.500674 |
141261 rows × 11 columns
Rationale: Location data needed/required to be used in my modeling. If there is no location data entered for the row, then it renders that row not viable for project's purposes.
# make a mask with the missing values for all three location columns
mask = ufo_df[["city", "state", "country"]].isnull().all(axis=1)
# count the number of missing values in the mask
total = mask.sum()
print("Total of", total, "rows of missing values in all three columns of city, state, and country.")
Total of 146 rows of missing values in all three columns of city, state, and country.
# drop the rows with missing values in all three columns: removal of 146 rows
ufo_df = ufo_df.dropna(subset=["city", "state", "country"], how="all")
# check to see if these rows have been deleted
mask = ufo_df[["city", "state", "country"]].isnull().all(axis=1)
# get the entire row of data for those rows
ufo_df[mask]
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude |
|---|
ufo_df.shape[0]
141115
# displays number of rows with "," character in the city column
ufo_df[ufo_df["city"].str.contains(",", na=False)].shape[0]
829
# fix_city repairs city column for specific errors
def fix_city(df):
# remove the comma character and all text after it (errors in rows 789, 869, etc; total of 829 rows with this error)
df["city"] = df["city"].str.split(",").str[0]
# removes '(" or ")" characters and any text in them
df.loc[:, "city"] = df["city"].str.replace(r"\(.*\)", "", regex=True)
# remove spaces before and after the city values
df["city"] = df["city"].str.strip()
# remove ? character (error in row 805)
df["city"] = df["city"].str.strip("?")
# change to title case
df["city"] = df["city"].str.title()
fix_city(ufo_df)
lat_long_df = ufo_df[["city", "city_latitude", "city_longitude"]]
lat_long_df
| city | city_latitude | city_longitude | |
|---|---|---|---|
| 0 | Mountlake Terrace | 47.794100 | -122.306600 |
| 1 | Hamden | 41.373941 | -72.921325 |
| 2 | Charlottesville | 38.055968 | -78.494482 |
| 3 | Lincoln Park | 42.238500 | -83.178300 |
| 4 | Douai | NaN | NaN |
| ... | ... | ... | ... |
| 141256 | Branford | 41.277600 | -72.811500 |
| 141257 | St Louis | 38.626989 | -90.307029 |
| 141258 | Batavia | 41.847800 | -88.307500 |
| 141259 | Liberty Lake | 47.651700 | -117.083800 |
| 141260 | Fredericksburg | 38.299511 | -77.500674 |
141115 rows × 3 columns
# fix_state() takes a dataframe as an argument, has embedded helper function that applies
# corrections to the state and country columns (for errors in row 136716, 137237, 137247, 137458 and on)
def fix_state(df):
def fix_state_helper(row):
# Get the state and country values from the row
state = str(row["state"])
country = row["country"]
# Check if the state has 4 or more characters
if len(str(state)) >= 4:
# Check if the state is Quebec and the country is Canada
if state == "QUEBEC" and country == "Canada":
# Reassign it to QC
return "QC"
# Check if the state is Ontario and the country is Canada
elif state == "ONTARIO" and country == "Canada":
# Reassign it to ON
return "ON"
# Check if the state is Ontario and the country is USA
elif state == "ONTARIO" and country == "USA":
# Reassign it to the title case of the state (not all caps)
return state.title()
else:
# Return the original state
return state
else:
# Return the original state
return state
# Apply the check_state_helper function to the dataframe and assign the result back to the state column
df["state"] = df.apply(fix_state_helper, axis=1)
# Call the fix_state function
fix_state(ufo_df)
# fix_title_case() changes string s to title case
def fix_title_case(string):
string = str(string)
# returns a string in which each word's first letter has been changed to uppercase
return re.sub(r"[A-Za-z]+('[A-Za-z]+)?", lambda word: word.group(0).capitalize(), string)
ufo_df["country"] = ufo_df["country"].apply(fix_title_case)
# check_country() takes a dataframe as an argument and repairs state and/or country columns for errors/non-standardized formatting
# NOTE: check_country takes about 2 -3 minutes to run
def check_country(df):
def check_country_helper(row):
# get the state, country, and city values from the row
state = row["state"]
country = row["country"]
city = row["city"]
# check if the country is Colorado Springs (error in row 131202)
if country == "Colorado Springs":
# Reassign country to USA
return pd.Series({"state": state, "country": "USA"})
# check if the country is Ireland (for 171+ rows)
elif country == "Ireland" or country == "Dublin Ireland" or country == "Northern Ireland" or country == "Republic Of Ireland":
# reassign country to United Kingdom and state to Ireland
return pd.Series({"state": "Ireland", "country": "United Kingdom"})
# check if the country is Scotland (for 4 rows)
elif country == "North Wales" or country == "Uk/Wales":
# reassign country to United Kingdom and state to Scotland
return pd.Series({"state": "Wales", "country": "United Kingdom"})
# check if the country is Scotland (for 4 rows)
elif country == "Scotland":
# reassign country to United Kingdom and state to Scotland
return pd.Series({"state": "Scotland", "country": "United Kingdom"})
elif country == "United Kingdom":
if (city == "Essex"):
# assign state to Essex so map will display correctly
return pd.Series({"state": "Essex", "country": country})
elif country == "Unknown":
# check if this city is in a US state so map will display correctly
if (city == "Essex") and ((state == "CA")|(state == "CT")|(state == "MD")|(state == "MT")|(state == "NC")|(state == "VT")):
# assign country to "USA"
return pd.Series({"state": state, "country": "USA"})
# check if the country is "China Also?"" (1 row)
elif country == "China Also?":
if (city == "New Orleans"):
# reassign country to USA
return pd.Series({"state": state, "country": "USA"})
# check if the country is Russia (Former Ukraine) (1 row)
elif country == "Russia (Former Ukraine)":
if (city == "Mariupol"):
# reassign country to Ukraine
return pd.Series({"state": state, "country": "Ukraine"})
# check if the country is Yes (4 errors)
elif country == "Yes":
if (city == "Belfast"):
# reassign country to United Kingdom and state to Ireland
return pd.Series({"state": "Ireland", "country": "United Kingdom"})
elif (city == "Cebu"):
# reassign country to Philippines
return pd.Series({"state": state, "country": "Philippines"})
else:
# reassign country to Unknown
return pd.Series({"state": state, "country": "Unknown"})
else:
# return the original state and country values
return pd.Series({"state": state, "country": country})
# apply the check_country_helper function to the dataframe and assign the result back to the state and country columns
df[["state", "country"]] = df.apply(check_country_helper, axis=1)
check_country(ufo_df)
# fix_country repairs country column for specific errors seen in the dataframe
def fix_country(df):
# remove spaces before and after the country values (error in row 1057)
df["country"] = df["country"].str.strip()
# replace "Bahamas/Usa" with "Bahamas" in the country column (1 error)
df["country"] = df["country"].replace("Bahamas/Usa", "Bahamas")
# replace "Caribbean" with "Caribbean Sea" in the country column (1 error)
df["country"] = df["country"].replace("Caribbean", "Caribbean Sea")
# replace "Grand Cayman" with "Grand Cayman Island" in the country column (3 errors)
df["country"] = df["country"].replace("Grand Cayman", "Grand Cayman Island")
# replace "Guatamala" with "Guatemala" in the country column (1 error)
df["country"] = df["country"].replace("Guatamala", "Guatemala")
# replace "Gulf Of Mexico / Florida Straits" with "Gulf Of Mexico" in the country column (1 error)
df["country"] = df["country"].replace("Gulf Of Mexico / Florida Straits", "Gulf Of Mexico")
# replace "Guyana, South America" with "Guyana" in the country column (1 error)
df["country"] = df["country"].replace("Guyana, South America", "Guyana")
# replace "Hong Kong S.A.R." with "Hong Kong" in the country column (1 error)
df["country"] = df["country"].replace("Hong Kong S.A.R.", "Hong Kong")
# replace "Hong Kong Sar" with "Hong Kong" in the country column (1 error)
df["country"] = df["country"].replace("Hong Kong Sar", "Hong Kong")
# replace "India (South- East)" with "India" in the country column (1 error)
df["country"] = df["country"].replace("India (South- East)", "India")
# replace "In Orbit" with "In Orbit (In Space)" in the country column (13 errors)
df["country"] = df["country"].replace("In Orbit", "In Orbit (In Space)")
# replace "In Orbit In Space" with "In Orbit (In Space)" in the country column (1 error)
df["country"] = df["country"].replace("In Orbit In Space", "In Orbit (In Space)")
# replace "In Orbit (Space)" with "In Orbit (In Space)" in the country column (1 error)
df["country"] = df["country"].replace("In Orbit (Space)", "In Orbit (In Space)")
# replace "Isle Of Man, Great Britain" with "Isle Of Man" in the country column (1 error)
df["country"] = df["country"].replace("Isle Of Man, Great Britain", "Isle Of Man")
# replace "Kazakstan" with "Kazakhstan" in the country column (1 error)
df["country"] = df["country"].replace("Kazakstan", "Kazakhstan")
# replace "Kenya (East Africa)" with "Kenya" in the country column (1 error)
df["country"] = df["country"].replace("Kenya (East Africa)", "Kenya")
# replace "Korea (South)" with "South Korea" in the country column (1 error)
df["country"] = df["country"].replace("Korea (South)", "South Korea")
# replace "Korea South" with "South Korea" in the country column (1 error)
df["country"] = df["country"].replace("Korea South", "South Korea")
# replace "Republic Of Korea" with "South Korea" in the country column (1 error)
df["country"] = df["country"].replace("Republic Of Korea", "South Korea")
# replace "Kosova" with "Kosovo" in the country column (1 error)
df["country"] = df["country"].replace("Kosova", "Kosovo")
# replace "Lebanon (Middle East)" with "Lebanon" in the country column (1 error)
df["country"] = df["country"].replace("Lebanon (Middle East)", "Lebanon")
# replace "Luxembourg" with "Luxemburg" (German-preferred spelling) in the country column (1 error)
df["country"] = df["country"].replace("Luxembourg", "Luxemburg")
# replace "Mediterrainian Sea" with "Mediterranean Sea" in the country column (1 error)
df["country"] = df["country"].replace("Mediterrainian Sea", "Mediterranean Sea")
# replace "Mediterranean Sea." with "Mediterranean Sea" in the country column (1 error)
df["country"] = df["country"].replace("Mediterranean Sea.", "Mediterranean Sea")
# replace "Myanmar (Burma)" with "Myanmar" (1 error)
df["country"] = df["country"].replace("Myanmar (Burma)", "Myanmar")
# replace "Netherlands The" with "Netherlands" in the country column (1 error)
df["country"] = df["country"].replace("Netherlands The", "Netherlands")
# replace "Netherland Antilles" with "Netherlands Antilles" in the country column (1 error)
df["country"] = df["country"].replace("Netherland Antilles", "Netherlands Antilles")
# replace "Nigeria 🇳🇬" with "Nigeria" in the country column (1 error)
df["country"] = df["country"].replace("Nigeria 🇳🇬", "Nigeria")
# replace "Japan (Okinawa)" with "Okinawa" in the country column (1 error)
df["country"] = df["country"].replace("Japan (Okinawa)", "Okinawa")
# replace "Okinawa Japan" with "Okinawa" in the country column (1 error)
df["country"] = df["country"].replace("Okinawa Japan", "Okinawa")
# replace "Panama Canal Zone" with "Panama Canal" in the country column (1 error)
df["country"] = df["country"].replace("Panama Canal Zone", "Panama Canal")
# replace "Perú" with "Peru" in the country column (1 error)
df["country"] = df["country"].replace("Perú", "Peru")
# replace "Puerto Rico Commonwealth" with "Puerto Rico" in the country column (1 error)
df["country"] = df["country"].replace("Puerto Rico Commonwealth", "Puerto Rico")
# replace "Puerto Rico Usa" with "Puerto Rico" in the country column (1 error)
df["country"] = df["country"].replace("Puerto Rico Usa", "Puerto Rico")
# replace "Russia (Ussr)" with "Russia" in the country column (1 error)
df["country"] = df["country"].replace("Russia (Ussr)", "Russia")
# replace "Russian Federation" with "Russia" in the country column (1 error)
df["country"] = df["country"].replace("Russian Federation", "Russia")
# replace "Slovak Republic" with "Slovakia" in the country column (1 error)
df["country"] = df["country"].replace("Slovak Republic", "Slovakia")
# replace "Republic Of South Africa" with "South Africa" in the country column (1 error)
df["country"] = df["country"].replace("Republic Of South Africa", "South Africa")
# replace "South Africa" with "South Africa" in the country column (1 error)
df["country"] = df["country"].replace("South Africa", "South Africa")
# replace "South-Africa" with "South Africa" in the country column (1 error)
df["country"] = df["country"].replace("South-Africa", "South Africa")
# replace "Srilanka" with "Sri Lanka" in the country column (1 error)
df["country"] = df["country"].replace("Srilanka", "Sri Lanka")
# replace "Srui Lanka" with "Sri Lanka" in the country column (1 error)
df["country"] = df["country"].replace("Srui Lanka", "Sri Lanka")
# replace "Surinam" with "Suriname" in the country column (1 error)
df["country"] = df["country"].replace("Surinam", "Suriname")
# replace "Swizterland" with "Switzerland" in the country column (1 error)
df["country"] = df["country"].replace("Swizterland", "Switzerland")
# replace "Tenerife, Spain" with "Tenerife" in the country column (1 error)
df["country"] = df["country"].replace("Tenerife, Spain", "Tenerife")
# replace "Trinidad And Tobago" with "Trinidad & Tobago" in the country column (1 error)
df["country"] = df["country"].replace("Trinidad And Tobago", "Trinidad & Tobago")
# replace "Trinidad/Tobago" with "Trinidad & Tobago" in the country column (1 error)
df["country"] = df["country"].replace("Trinidad/Tobago", "Trinidad & Tobago")
# replace "Caicos Islands" with "Turks & Caicos Islands" in the country column (1 error)
df["country"] = df["country"].replace("Caicos Islands", "Turks & Caicos Islands")
# replace "Turks & Caicos" with "Turks & Caicos Islands" in the country column (1 error)
df["country"] = df["country"].replace("Turks & Caicos", "Turks & Caicos Islands")
# replace "Turks And Caicos Islands" with "Turks & Caicos Islands" in the country column (3 errors)
df["country"] = df["country"].replace("Turks And Caicos Islands", "Turks & Caicos Islands")
# replace "TüRkiye" with "Turkey" in the country column (1 error)
df["country"] = df["country"].replace("TüRkiye", "Turkey")
# replace "United Arab Emirates" with "UAE" in the country column (29 error)
df["country"] = df["country"].replace("United Arab Emirates", "UAE")
# replace "United Arad Emirates" with "UAE" in the country column (1 error)
df["country"] = df["country"].replace("United Arad Emirates", "UAE")
# replace "Uae" with "UAE" in the country column (1 error)
df["country"] = df["country"].replace("Uae", "UAE")
# replace "Us And Canada Border" with "USA & Canada Border" in the country column (1 error)
df["country"] = df["country"].replace("Us And Canada Border", "USA & Canada Border")
# replace "Usa & Canada" with "USA & Canada Border" in the country column (1 error)
df["country"] = df["country"].replace("Usa & Canada", "USA & Canada Border")
# replace "Usa/Canadian Waters" with "USA & Canada Border" in the country column (1 error)
df["country"] = df["country"].replace("Usa/Canadian Waters", "USA & Canada Waters")
# replace "United States" with "USA" in the country column (error in row 137081)
df["country"] = df["country"].replace("United States", "USA")
# replace "Untied States Of America" with "USA" in the country column (error in row 137081)
df["country"] = df["country"].replace("Untied States Of America", "USA")
# replace "U.S.A." with "USA" in the country column (1 error)
df["country"] = df["country"].replace("U.S.A.", "USA")
# replace "usa" with "Usa" in the country column (errors in rows 95200, 132558)
df["country"] = df["country"].replace("Usa", "USA")
# replace "Usav" with "Usausa" in the country column (1 error)
df["country"] = df["country"].replace("Usav", "USA")
# replace "usa" with "Usausa" in the country column (1 error)
df["country"] = df["country"].replace("Usausa", "USA")
# replace "Great Britain" with "United Kingdom" (error in row 66854)
df["country"] = df["country"].replace("Great Britain", "United Kingdom")
# replace "Uk/England" with "United Kingdom" (errors in rows 138737, 139079, 139448)
df["country"] = df["country"].replace("Uk/England", "United Kingdom")
# replace several designations of Virgin Islands with "Virgin Islands (USA) or (UK)"
df["country"] = df["country"].replace("Us Virgin Islands", "Virgin Islands (USA)")
df["country"] = df["country"].replace("U. S. Virgin Islands", "Virgin Islands (USA)")
df["country"] = df["country"].replace("U.S. Virgin Islands", "Virgin Islands (USA)")
df["country"] = df["country"].replace("Virgin Islands (Us)","Virgin Islands (USA)")
df["country"] = df["country"].replace("Virgin Islands (U. S.)","Virgin Islands (USA)")
df["country"] = df["country"].replace("Virgin Islands (U.S.)","Virgin Islands (USA)")
df["country"] = df["country"].replace("Usvi","Virgin Islands (USA)")
df["country"] = df["country"].replace("Ukvi","Virgin Islands (UK)")
df["country"] = df["country"].replace("UK Virgin Islands", "Virgin Islands (UK)")
df["country"] = df["country"].replace("British Virgin Islands", "Virgin Islands (UK)")
# replace "Viet Nam" with "Vietnam" (25 errors)
df["country"] = df["country"].replace("Viet Nam", "Vietnam")
# replace "Nan" with "Unknown" (1 error)
df["country"] = df["country"].replace("Nan", "Unknown")
# replace "No" with "Unknown" (1 error)
df["country"] = df["country"].replace("No", "Unknown")
# replace "Non Applicable" with "Unknown" (1 error)
df["country"] = df["country"].replace("Non Applicable", "Unknown")
# replace "None" with "Unknown" (1 error)
df["country"] = df["country"].replace("None", "Unknown")
# replace "Yup" with "Unknown" (1 error)
df["country"] = df["country"].replace("Yup", "Unknown")
# replace city = "United States" to None (error in row 41194)
if (df["city"] == "United States").any() and (df["country"] == "USA").any():
df["city"] = df["city"].replace("United States", "")
fix_country(ufo_df)
# Remove later for testing
#ufo_df.loc[ufo_df["city"] == "Essex"]
# check_us_state() takes a dataframe as an argument, checks whether country == "U" AND
# state is a US state, and if so, assigns country = "USA"
def check_us_state(df):
# make a copy the original dataframe so we don't alter actual df's other columns other than the country column
df_copy = df.copy()
# check the dataframe for rows with country == "U"
tmp_df = df_copy[df_copy["country"] == "U"]
# define a function that returns True if the string is a US state, False otherwise
def is_us_state(s):
return bool(us.states.lookup(s))
# apply the function to the state column and store the result in a new column called 'is_state'
tmp_df["is_state"] = tmp_df["state"].apply(is_us_state)
# assign "USA" to the country column where is_state is True
tmp_df.loc[tmp_df["is_state"], 'country'] = "USA"
# drop the is_state column
tmp_df = tmp_df.drop("is_state", axis=1)
# update the original dataframe with the filtered dataframe
df_copy.update(tmp_df)
# return the updated dataframe
return df_copy
ufo_df = check_us_state(ufo_df)
# print_list prints the counter series argument in a list form starting with 1
def print_list(statement, colname, colname2, counter):
print(statement, "\n ", colname, "\t\t ", colname2)
i = 1
while i <= len(counter):
row = counter.iloc[i-1]
# checks whether i > 9 and the length of colname > 19 chars to display with less tab space
if (i > 9) and (len(row[colname]) > 19):
print(f"{i}. {row[colname]}\t{row[colname2]}")
else:
print(f"{i}. {row[colname]}\t\t{row[colname2]}")
i += 1
# print_med_list prints the counter series argument in a medium list form starting with 1
def print_med_list(statement, colname, colname2, counter):
print(statement, "\n ", colname, " ", colname2)
i = 1
while i <= len(counter):
row = counter.iloc[i-1]
if i < 10:
print(f"{i}. {row[colname]}\t\t{row[colname2]}")
else:
print(f"{i}. {row[colname]}\t{row[colname2]}")
i += 1
# print_sm_list prints the counter series argument in a small list form starting with 1
def print_sm_list(statement, colname, colname2, counter):
print(statement, "\n ", colname, " ", colname2)
i = 1
while i <= len(counter):
row = counter.iloc[i-1]
print(f"{i}. {row[colname]}\t\t{row[colname2]}")
i += 1
# print_list_no_num prints the counter series argument in a list form without numbers
def print_list_no_num(statement, colname, colname2, counter):
print(statement, colname, " ", colname2)
i = 1
while i <= len(counter):
row = counter.iloc[i-1]
print(f" {row[colname]}\t{row[colname2]}")
i += 1
# print_longlist_no_num prints the counter series argument in a list form without numbers
def print_longlist_no_num(statement, colname, colname2, counter):
print(statement, colname, "\t ", colname2)
i = 1
while i <= len(counter):
row = counter.iloc[i-1]
print(f" {row[colname]}\t{row[colname2]}")
i += 1
# Supplemental Question: Compare Years with the Counts of UFO Sightings (WITHOUT hoax filter)
### For Analysis of Supplemental Questions secton below ###
# extract the year from the date_time column
compare_df = ufo_df.copy() # Need to make copy of ufo_df of possibility of altering the working dataframe
compare_df["year"] = pd.DatetimeIndex(compare_df["date_time"]).year
compare_df2 = ufo_df.copy()
compare_df2["year"] = pd.DatetimeIndex(compare_df2["date_time"]).year
# create a list of years for comparison
yr_list = [1762, 1780, 1800, 1810, 1820, 1850, 1870, 1880, 1900, 1910, 1920, 1925, 1930, 1935, 1940, 1945, 1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022]
recent_yr_list = [1960,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022]
# filter the dataframe by the given years
compare_df = compare_df.loc[compare_df["year"].isin(yr_list)]
compare_df2 = compare_df2.loc[compare_df2["year"].isin(recent_yr_list)]
# group by year and count sightings
yrs_comp_no_fil = compare_df.groupby("year").size()
recent_yrs_comp_no_fil = compare_df2.groupby("year").size()
# reset index to count
yrs_comp_no_fil = yrs_comp_no_fil.reset_index(name="count")
recent_yrs_comp_no_fil = recent_yrs_comp_no_fil.reset_index(name="count")
#print("Year and Counts of UFO Sightings (no hoax filter)\n", yrs_comp_no_fil.head(7).astype(int).to_string(index=False))
#print("\nRecent Year and Counts of UFO Sightings (no hoax filter)\n", recent_yrs_comp_no_fil.tail(7).astype(int).to_string(index=False))
# Supplemental Question: Top 10 Locations with Highest Counts of UFO Sightings WITHOUT hoax filter
# Get the top 10 locations with UFO sightings
compare_df = ufo_df.copy()
# group by location and count sightings
sightings_no_fil = compare_df.groupby(["city"]).size()
sightings_st_no_fil = compare_df.groupby(["state"]).size()
sightings_ctry_no_fil = compare_df.groupby(["country"]).size()
# sort by count in descending order
sightings_no_fil = sightings_no_fil.sort_values(ascending=False)
sightings_st_no_fil = sightings_st_no_fil.sort_values(ascending=False)
sightings_ctry_no_fil = sightings_ctry_no_fil.sort_values(ascending=False)
# select the first 10 rows
sightings_no_fil = sightings_no_fil.head(10)
sightings_st_no_fil = sightings_st_no_fil.head(10)
sightings_ctry_no_fil = sightings_ctry_no_fil.head(10)
# reset index to get a dataframe
sightings_no_fil = sightings_no_fil.reset_index(name='count')
sightings_st_no_fil = sightings_st_no_fil.reset_index(name='count')
sightings_ctry_no_fil = sightings_ctry_no_fil.reset_index(name='count')
# add 1 to the new index to make list start at 1
sightings_no_fil.index = sightings_no_fil.index + 1
sightings_st_no_fil.index = sightings_st_no_fil.index + 1
sightings_ctry_no_fil.index = sightings_ctry_no_fil.index + 1
#print("Top Cities of UFO Sightings (no hoax filter)\n", sightings_no_fil)
#print("\nTop States of UFO Sightings (no hoax filter)\n", sightings_st_no_fil)
#print("\nTop Countries of UFO Sightings (no hoax filter)\n", sightings_ctry_no_fil)
# Supplemental Question: Top 10 US Locations and States of Highest Counts of UFO Sightings WITHOUT hoax filter
# Get the top 10 US locations with UFO sightings
compare_df = ufo_df.copy() # Need to make copy of ufo_df because this block may alter the actual ufo_df dataframe
# filter the dataframe by country not equal to USA
compare_df = compare_df.loc[compare_df["country"] == "USA"]
# group by location and count sightings
sightings_us_no_fil = compare_df.groupby(["city"]).size()
sightings_us_st_no_fil = compare_df.groupby(["state"]).size()
# sort by count in descending order
sightings_us_no_fil = sightings_us_no_fil.sort_values(ascending=False)
sightings_us_st_no_fil = sightings_us_st_no_fil.sort_values(ascending=False)
# select the first 10 rows
sightings_us_no_fil = sightings_us_no_fil.head(10)
sightings_us_st_no_fil = sightings_us_st_no_fil.head(10)
# reset index to get a dataframe
sightings_us_no_fil = sightings_us_no_fil.reset_index(name='count')
sightings_us_st_no_fil = sightings_us_st_no_fil.reset_index(name='count')
# add 1 to the new index to make list start at 1
sightings_us_no_fil.index = sightings_us_no_fil.index + 1
sightings_us_st_no_fil.index = sightings_us_st_no_fil.index + 1
#print("Top US Cities with UFO Sightings without hoax filter\n", sightings_us_no_fil)
#print("\nTop US States with UFO Sightings without hoax filter\n", sightings_us_st_no_fil)
# Convert the text within the summary and text columns to lowercase
ufo_df["summary"].str.lower()
ufo_df["text"].str.lower()
# Print total count of rows with the words "hoax", "fake", "joke", or "prank" in either column
print(ufo_df.loc[ufo_df["summary"].str.lower().str.contains("hoax|fake|joke|prank", na=False)
| ufo_df["text"].str.lower().str.contains("hoax|fake|joke|prank", na=False)].shape[0])
3241
## count of Rows with the words "hoax", "fake", "joke", or "prank" in summary and text columns: 3241
# extract the rows that contain those words in either column
ufo_df.loc[ufo_df["summary"].str.lower().str.contains("hoax|fake|joke|prank")
| ufo_df["text"].str.lower().str.contains("hoax|fake|joke|prank")]
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | ((HOAX??)) On Google earth, found underwater ... | USA | NaN | CA | NaN | NaN | NaN | Occurred : Reported: 7/24/2019 1:45:33 PM 13... | I was on Google earth and found a underwater b... | NaN | NaN |
| 45 | Low flying huge disc spotted in Indianapolis, ... | USA | Indianapolis | IN | NaN | disk | 5 minutes | Occurred : 6/15/1963 22:00 (Entered as : 06/1... | Low flying huge disc spotted in Indianapolis, ... | 39.804642 | -86.137737 |
| 82 | Orange craft moving silently and fading in and... | USA | Westborough | MA | 2000-03-31T20:30:00 | triangle | 30-45 seconds | Occurred : 3/31/2000 20:30 (Entered as : 0331... | Orange craft moving silently and fading in and... | 42.264700 | -71.617500 |
| 229 | Orangy/Amber orbs of light up close | USA | Rachel | NV | 2019-09-20T23:56:00 | sphere | 5 - 10 min. | Occurred : 9/20/2019 23:56 (Entered as : 09/2... | Orangy/Amber orbs of light up close \n \nMe an... | NaN | NaN |
| 284 | ((HOAX??)) Me and my friend were out walking ... | USA | Marengo | IN | 2019-10-04T00:00:00 | light | NaN | Occurred : 10/4/2019 (Entered as : 11/15/19)... | Me and my friend were out walking the hunting ... | 38.377600 | -86.357200 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 140936 | ((HOAX??)) Looked to be changing attitude | USA | Weatogue | CT | 2022-05-01T14:43:00 | unknown | 3 minutes | Occurred : 5/1/2022 14:43 (Entered as : 05/01... | Looked to be changing attitude \n \nPossible D... | 41.904467 | -72.993000 |
| 141119 | IN DANGER Actual footage of UFO's and Alien on... | USA | Lumberton | MS | 2022-04-19T23:56:00 | triangle | All the time | Occurred : 4/19/2022 23:56 (Entered as : 04/1... | IN DANGER Actual footage of UFO's and Alien on... | 30.997900 | -89.454700 |
| 141124 | ((HOAX)) alien | USA | Daytona Beach | FL | 2022-04-21T15:30:00 | triangle | 5 seconds | Occurred : 4/21/2022 15:30 (Entered as : 04/2... | My Girlfriend doesnt love me and she thinks im... | 29.201564 | -81.043778 |
| 141174 | ((HOAX)) 11111 | Unknown | Grayson | nan | 2022-05-09T17:10:00 | NaN | Spam | Occurred : 5/9/2022 17:10 (Entered as : 05/09... | 11111 \n \nSpam | NaN | NaN |
| 141258 | A very small white light hovering above the cl... | USA | Batavia | IL | 2013-04-30T02:19:00 | circle | 45 seconds | Occurred : 4/30/2013 02:19 (Entered as : 04/3... | A very small white light hovering above the cl... | 41.847800 | -88.307500 |
3241 rows × 11 columns
# get total count of rows with suspect words
ufo_df.loc[ufo_df["summary"].str.lower().str.contains("hoax|fake|joke|prank", na=False)].shape[0]
1490
# get total count of rows with "suspect words"
ufo_df.loc[ufo_df["text"].str.lower().str.contains("hoax|fake|joke|prank", na=False)].shape[0]
2459
ufo_df.loc[ufo_df["summary"].str.lower().str.contains("hoax|fake", na=False)].shape[0]
1458
ufo_df.loc[ufo_df["text"].str.lower().str.contains("hoax|fake", na=False)].shape[0]
1552
# extract the rows that contain "hoax" that are found only in summary column
ufo_df.loc[ufo_df["summary"].str.lower().str.contains("hoax|fake", na=False)]
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | ((HOAX??)) On Google earth, found underwater ... | USA | NaN | CA | NaN | NaN | NaN | Occurred : Reported: 7/24/2019 1:45:33 PM 13... | I was on Google earth and found a underwater b... | NaN | NaN |
| 284 | ((HOAX??)) Me and my friend were out walking ... | USA | Marengo | IN | 2019-10-04T00:00:00 | light | NaN | Occurred : 10/4/2019 (Entered as : 11/15/19)... | Me and my friend were out walking the hunting ... | 38.377600 | -86.357200 |
| 546 | Bright white thing I saw. It was ther then it ... | USA | Springdale | AR | 2019-10-12T03:00:00 | circle | 10min | Occurred : 10/12/2019 03:00 (Entered as : 10/... | Bright white thing I saw. It was ther then it... | 36.178214 | -94.103714 |
| 1218 | ((HOAX??))The TV was changing to the same char... | USA | Vancouver | WA | 2008-02-18T21:00:00 | other | 3 days | Occurred : 2/18/2008 21:00 (Entered as : 02/1... | The TV was changing to the same character's th... | 45.658570 | -122.590147 |
| 2146 | ((Hoax?)) Dark behind the cloud | USA | Oxford | nan | 2022-10-24T02:00:00 | circle | 2 min | Occurred : 10/24/2022 02:00 (Entered as : 10/... | Dark behind the cloud \n \nOh my! | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 138111 | I saw a bright object resembling a plane with ... | USA | Marshall | TX | 2022-07-14T20:27:00 | other | 2 minutes | Occurred : 7/14/2022 20:27 (Entered as : 07/1... | I saw a bright object resembling a plane with ... | 32.523547 | -94.336744 |
| 140719 | ((HOAX)) UFO on a star with men in red armore... | USA | Austin | TX | 2021-12-01T08:00:00 | changing | 5 years | Occurred : 12/1/2021 08:00 (Entered as : 12/0... | UFO on a star with men in red armored uniforms... | 30.316702 | -97.757991 |
| 140936 | ((HOAX??)) Looked to be changing attitude | USA | Weatogue | CT | 2022-05-01T14:43:00 | unknown | 3 minutes | Occurred : 5/1/2022 14:43 (Entered as : 05/01... | Looked to be changing attitude \n \nPossible D... | 41.904467 | -72.993000 |
| 141124 | ((HOAX)) alien | USA | Daytona Beach | FL | 2022-04-21T15:30:00 | triangle | 5 seconds | Occurred : 4/21/2022 15:30 (Entered as : 04/2... | My Girlfriend doesnt love me and she thinks im... | 29.201564 | -81.043778 |
| 141174 | ((HOAX)) 11111 | Unknown | Grayson | nan | 2022-05-09T17:10:00 | NaN | Spam | Occurred : 5/9/2022 17:10 (Entered as : 05/09... | 11111 \n \nSpam | NaN | NaN |
1458 rows × 11 columns
# extract the rows that contain "hoax" that are found only in text column
ufo_df.loc[ufo_df["text"].str.lower().str.contains("hoax|fake", na=False)]
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 284 | ((HOAX??)) Me and my friend were out walking ... | USA | Marengo | IN | 2019-10-04T00:00:00 | light | NaN | Occurred : 10/4/2019 (Entered as : 11/15/19)... | Me and my friend were out walking the hunting ... | 38.377600 | -86.357200 |
| 546 | Bright white thing I saw. It was ther then it ... | USA | Springdale | AR | 2019-10-12T03:00:00 | circle | 10min | Occurred : 10/12/2019 03:00 (Entered as : 10/... | Bright white thing I saw. It was ther then it... | 36.178214 | -94.103714 |
| 561 | 25-30 white pulsating objects with red glow mo... | USA | University Place | WA | 2019-10-12T20:19:00 | egg | 12 minutes | Occurred : 10/12/2019 20:19 (Entered as : 10/... | 25-30 white pulsating objects with red glow mo... | 47.201300 | -122.571100 |
| 1438 | You already have a partial report from a truck... | USA | Metropolis | IL | 2019-08-30T21:00:00 | circle | 5+ minutes | Occurred : 8/30/2019 21:00 (Entered as : 0830... | You already have a partial report from a truck... | 37.177200 | -88.723000 |
| 2001 | A brown beetle shaped craft flew over Harris C... | USA | Houston | TX | 2022-10-22T21:36:00 | oval | 15 seconds | Occurred : 10/22/2022 21:36 (Entered as : 10/... | A brown beetle shaped craft flew over Harris C... | 29.782477 | -95.440624 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 139691 | Observed an object/phenomena that morphed, pul... | USA | Las Vegas | NV | 2022-03-13T13:14:00 | changing | 33 minutes | Occurred : 3/13/2022 13:14 (Entered as : 03/1... | Observed an object/phenomena that morphed, pul... | 36.145976 | -115.184575 |
| 139979 | I know what my sister saw. She knows what I saw. | USA | Homer Glen | IL | 1993-02-18T02:30:00 | disk | Don't remember | Occurred : 2/18/1993 02:30 (Entered as : 02/1... | I know what my sister saw. She knows what I sa... | 41.601700 | -87.955700 |
| 139993 | Disk U.F.O. | USA | Moore | OK | 2018-04-19T16:00:00 | disk | 5 minutes | Occurred : 4/19/2018 16:00 (Entered as : 04/1... | Disk U.F.O. \n \nThis is the 3rd report. My ap... | NaN | NaN |
| 140223 | The object was not noticed at the moment of ta... | Uruguay | Piriapolis | MALDONADO DEPARTMENT | 2021-11-15T12:00:00 | other | NaN | Occurred : 11/15/2021 12:00 (Entered as : 11/... | The object was not noticed at the moment of ta... | NaN | NaN |
| 140829 | I saw 3 blinking lights then it disappears. | USA | Staunton | VA | 2022-02-28T19:58:00 | light | 2 seconds | Occurred : 2/28/2022 19:58 (Entered as : 02/2... | I saw 3 blinking lights then it disappears. \n... | 38.155100 | -79.060800 |
1552 rows × 11 columns
# search for total count of rows with "hoax" in summary column
ufo_df.loc[ufo_df["summary"].str.lower().str.contains("hoax", na=False)].shape[0]
1438
ufo_df.loc[284, "text"]
'Me and my friend were out walking the hunting ground we herd something up in the sky their was to bright lights looked like a plane at first but it was standing still and then it just took off when we looked around and confused \n \n \n((NUFORC Note: One of two probable hoaxes from same source. Date is flawed. PD))'
# search for total count of rows with "hoax" in text column
ufo_df.loc[ufo_df["text"].str.lower().str.contains("hoax", na=False)].shape[0]
1397
ufo_df.loc[16, "summary"]
'((HOAX??)) On Google earth, found underwater base with three large doors and a Tictac shaped UFO. ((anonymous report))'
ufo_df.loc[16, "text"]
'I was on Google earth and found a underwater base with three large doors and a Tictac shaped UFO sitting on top of the three large doors base is off the coast of la California it is sitting in a mountain range. \n \n \n((NUFORC Note: Witness elects to remain totally anonymous; provides no contact information. PD))'
ufo_df.loc[1218, "summary"]
"((HOAX??))The TV was changing to the same character's the radio stations I had a scratch on my forehead and a circle on my finger"
ufo_df.loc[1218, "text"]
"The TV was changing to the same character's the radio stations I had a scratch on my forehead and a circle on my finger I could hear \n \nI had a circle on my finger and a scratch on my forehead and my tailbone Hurt and the tv would only have the same characters on it one was black and I could hear the electronics the radio would only play three different stations pictures on my phone would have wierd stuff in them it smelled like burnt wires I could hear something in the sky but couldn't see it shortly after the expiereance a helicopter followed me around I could see 3 different kinda orbs the next morning our dog died"
ufo_df.loc[140719, "summary"]
'((HOAX)) UFO on a star with men in red armored uniforms about 60-100'
ufo_df.loc[140719, "text"]
'UFO on a star with men in red armored uniforms about 60-100 \n \nUFO flew over zapping a sting ray hologram made of light with a light beam and the stingray with tusks disappeared along with the large saucer right after seeing a grey alien detained by the men in red uniforms 1 strong one shoved him into a compartment he looked like a little grey boy 7 years old he was looking at the ground smiling and the man approached him from behind. I was out of body very reAl and very much realistic the way the uniforms looked it was all too real to deny it happened…for fact'
ufo_df.loc[138754, "summary"]
'You need to look threw the videos very closely this in not a joke at all contact me when you do'
ufo_df.loc[138754, "text"]
'You need to look threw the videos very closely this in not a joke at all contact me when you do \n \nSo much to explain in text and i have more videos and pictures u guys need to get back to me fr I been trying to report this for over 2 years on an off an been giving up because no one wants to contact me this is serious shit for real get back to me'
ufo_df.loc[141124, "summary"]
'((HOAX)) alien'
ufo_df.loc[141124, "text"]
'My Girlfriend doesnt love me and she thinks im gay she looks like an alien and flew away in a dorito shaped craft and went to minecraft land'
ufo_df.loc[2001, "summary"]
"A brown beetle shaped craft flew over Harris County Texas under cloud cover on the first night of Jupiter's shining, October 22, 2022."
ufo_df.loc[2001, "text"]
"A brown beetle shaped craft flew over Harris County Texas under cloud cover on the first night of Jupiter's shining, October 22, 2022. \n \nOn the night of October 22nd, 2022 the skies had a few clouds and the night sky was not so dark. I was standing outside the carport and looked up to my left and saw: \n \nA brown colored craft whose backend was shaped like the Thorax of a beetle. Half of the front end could not be seen as A MILE OR MORE HIGH, this craft seemed to be elongated and using clouds for cover. I did manage to see what looked like the front right of the craft had a circular fan vortex looking spinners. Like for hovering . There were no lights, so now I know all the shiny UFO stuff is a hoax. Im still pretty scared and Im a navy veteran.This craft looked very eerie and I now know that when I die, WE ARE NOT ALONE. Whoever is reading this, words cannot describe actually seeing something that looks like it come off the chronicles of riddick."
Hoax filter description: Drop the rows of data with the words "hoax" in the summary column, which will be 1438 rows/reports. The text column has 1397 rows with the word "hoax," which may or may not be associated with the appearance of the word "hoax" in the summary column. I believe we should allow this word in the text column because people may be using 'hoax" to explain the event and not using it to say it is a hoax.
This removal is mildly selective and not too encompassing because dropping 1438 rows containg "hoax" is better than dropping 3241 rows containing suspect words ("hoax", "fake", "joke", or "prank") in both the summary or text columns. I had to make this judgment call to save as much data as possible while removing the obvious hoax data.
original_rows = ufo_df.shape[0]
print("original number of rows: ", original_rows)
original number of rows: 141115
# get the total rows before filtering hoaxes
original_rows = ufo_df.shape[0]
print("Count of rows before hoax filter: ", original_rows)
# convert the text within the summary and text columns to lowercase
ufo_df["summary"].str.lower()
ufo_df["text"].str.lower()
# filter_hoaxes takes a dataframe argument and deletes the rows in the summary column containing "hoax"
def filter_hoaxes(df):
df = df.drop(ufo_df[df["summary"].str.lower().str.contains("hoax", na=False)].index)
modified_rows = df.shape[0]
deleted_rows = original_rows - modified_rows
print("Count of rows after hoax filter applied to dataset: ", modified_rows)
print("Deleted rows: ", deleted_rows)
return df
ufo_df = filter_hoaxes(ufo_df)
Count of rows before hoax filter: 141115 Count of rows after hoax filter applied to dataset: 139677 Deleted rows: 1438
ufo_df.shape[0] # current dataframe size
139677
# extract_date_time() extracts date time data from the string argument
def extract_date_time(string):
# use a regular expression to find any pattern that looks like a valid date or time
matches = re.search(r"(\d{1,2})/(\d{1,2})/(\d{2,4}) | (\d{1,2})/(\d{1,2})/(\d{2,4}) (\d{1,2}):(\d{1,2}):(\d{1,2}) ([AP]M?)?", string)
# get the current date for year checks below
today = datetime.today()
# get the current year
current_year = today.year
if matches:
# extract the month, day, year, hour, minute, second, and AM/PM from the match object
month = int(matches.group(1)) if matches.group(1) else 0
day = int(matches.group(2)) if matches.group(2) else 0
year = int(matches.group(3)) if matches.group(3) else 0
hour = int(matches.group(4)) if matches.group(4) else 0
minute = int(matches.group(5)) if matches.group(5) else 0
second = int(matches.group(6)) if matches.group(6) else 0
am_pm = matches.group(7) if matches.group(7) else 0
# check for months with 31 days: Jan, Mar, May, July, Aug, Oct, Dec
if month == 1 or month == 3 or month == 5 or month == 7 or month == 8 or month == 10 or month == 12:
if day < 32 and day > 0:
# convert the year to four digits if needed
if year < 100:
# if year is two digits and is greater than current year's two digits
if year > (current_year - 2000):
year += 1900
else:
year += 2000
# convert to 24-hour format if needed
if am_pm == "PM" and hour < 12:
hour += 12
elif am_pm == "AM" and hour == 12:
hour = 0
# make a datetime object from extracted values
dt = datetime(year, month, day, hour, minute, second)
# format datetime in proper military format
result = dt.strftime("%Y-%m-%d %H:%M:%S")
return result
else:
return None
# check for months with 30 days: Apr, June, Sep, Nov
elif month == 4 or month == 6 or month == 9 or month == 11:
if day < 31 and day > 0:
if year < 100:
if year > (current_year - 2000):
year += 1900
else:
year += 2000
if am_pm == "PM" and hour < 12:
hour += 12
elif am_pm == "AM" and hour == 12:
hour = 0
dt = datetime(year, month, day, hour, minute, second)
result = dt.strftime("%Y-%m-%d %H:%M:%S")
return result
else:
return None
# check for Feb
elif month == 2:
if day < 30 and day > 0:
if year < 100:
if year > (current_year - 2000):
year += 1900
else:
year += 2000
if am_pm == "PM" and hour < 12:
hour += 12
elif am_pm == "AM" and hour == 12:
hour = 0
dt = datetime(year, month, day, hour, minute, second)
result = dt.strftime("%Y-%m-%d %H:%M:%S")
return result
else:
return None
else:
return None
else:
return None
# update_date_time() takes a dataframe argument; retrieves the date time data from the stats column and updates the date_time column
def update_date_time(df):
# check if the date_time column has any missing values, then proceed
if df["date_time"].isnull().any():
# apply the extract_date_time function to the stats column and assign the result to the date_time column
df["date_time"] = df["stats"].apply(extract_date_time)
return df
# extract_date_time_dev() extracts date time data from string argument with deviant data
def extract_date_time_dev(string):
# use a regular expression to find any pattern that looks like a valid date or time
matches = re.search(r"(\d{1,2})/(\d{1,2})/(\d{2,4}) (\d{1,2}):(\d{1,2}):(\d{1,2}) ([AP]M?)? | (\d{1,2})/(\d{1,2})/(\d{2,4}) (\d{1,2}):(\d{1,2})(?:(\d{1,2}))? | (\d{1,2})/(\d{1,2})/(\d{2,4}) (\d{1,2}):(\d{1,2}) ", string)
# overcast matches to str data
matches = str(matches)
# use re.search() to find the "re.Match ..." pattern in the string
match = re.search(r"<re.Match object; span=\(\d+, \d+\), match='(.+)'>", matches)
# use match.group() to get the matched text
text = match.group(1)
# use strip() and slicing operator to remove leading spaces
text = text[text.find(text.strip()):]
# use re.search() to find date time pattern that still has not been matched
matches = re.search(r"(\d{1,2})/(\d{1,2})/(\d{2,4}) (\d{1,2}):(\d{1,2}):(\d{1,2}) ([AP]M?)?", text)
# use re.search() one final time to find date time pattern that still has not been matched
if (matches is None):
matches = re.search(r"(\d{1,2})/(\d{1,2})/(\d{2,4}) (\d{1,2}):(\d{1,2})", text)
# get the current date for year checks below
today = datetime.today()
# get the current year
current_year = today.year
if matches:
# extract the month, day, year, hour, and minute from the match object
month = int(matches.group(1)) if matches.group(1) else 0
day = int(matches.group(2)) if matches.group(2) else 0
year = int(matches.group(3)) if matches.group(3) else 0
hour = int(matches.group(4)) if matches.group(4) else 0
minute = int(matches.group(5)) if matches.group(5) else 0
# check for months with 31 days: Jan, Mar, May, July, Aug, Oct, Dec
if month == 1 or month == 3 or month == 5 or month == 7 or month == 8 or month == 10 or month == 12:
if day < 32 and day > 0:
# convert the year to four digits if needed
if year < 100:
# if year is two digits and is greater than current year's two digits
if year > (current_year - 2000):
year += 1900
else:
year += 2000
# make a datetime object from extracted values
dt = datetime(year, month, day, hour, minute)
# format datetime in military format
result = dt.strftime("%Y-%m-%d %H:%M")
return result
else:
return None
# check for months with 30 days: Apr, June, Sep, Nov
elif month == 4 or month == 6 or month == 9 or month == 11:
if day < 31 and day > 0:
if year < 100:
if year > (current_year - 2000):
year += 1900
else:
year += 2000
dt = datetime(year, month, day, hour, minute)
result = dt.strftime("%Y-%m-%d %H:%M")
return result
else:
return None
# check for Feb
elif month == 2:
if day < 30 and day > 0:
if year < 100:
if year > (current_year - 2000):
year += 1900
else:
year += 2000
dt = datetime(year, month, day, hour, minute)
result = dt.strftime("%Y-%m-%d %H:%M")
return result
else:
return None
else:
return None
else:
return None
# update_date_time_dev() takes a dataframe argument; retrieves the date time data from the stats column and updates the date_time column
def update_date_time_dev(df):
# check if the date_time column has any missing values, then proceed
if df["date_time"].isnull().any():
# apply the extract_date_time_dev function to the stats column and assign the result to the date_time column
df["date_time"] = df["stats"].apply(extract_date_time_dev)
return df
# create a mask for rows with missing values for the date_time column
mask = ufo_df[["date_time"]].isnull().all(axis=1)
# call update_date_time() with the mask
ufo_df[mask] = update_date_time(ufo_df[mask])
# check the date_time column for further missing values: should be 211 rows of missing values
# make a mask with the missing values for the date_time column
mask = ufo_df[["date_time"]].isnull().all(axis=1)
# get the entire row of data for those rows
ufo_df[mask]
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | There was a air craft that appeared triangular... | USA | Lake Elsinore | CA | None | triangle | NaN | Occurred : Reported: 7/28/2019 3:15:56 AM 03... | There was a air craft that appeared triangular... | 33.664501 | -117.341532 |
| 41 | It look like it was a shooting star but it was... | India | Kolkata | nan | None | fireball | 10 minutes | Occurred : 20:08 (Entered as : 20:08) Report... | It look like it was a shooting star but it was... | NaN | NaN |
| 1877 | Large Amber orb with pulsating light. | USA | NaN | AZ | None | NaN | 45 minutes | Occurred : Reported: 8/15/2019 7:51:35 AM 07... | Large Amber orb with pulsating light. \n \nI w... | NaN | NaN |
| 1878 | Several Abductions | USA | Jonesville | MI | None | other | NaN | Occurred : Reported: 7/28/2019 8:42:56 PM 20... | I was abducted several times all through my ch... | 41.983800 | -84.644900 |
| 1879 | It was exactly the same as the night before ..... | USA | Grand Terrace | CA | None | cylinder | 2 minutes | Occurred : Reported: 8/11/2019 2:01:11 PM 14... | It was exactly the same as the night before ..... | 34.029500 | -117.310000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 133147 | Glowing red and blue Orb ((NUFORC Note: Witn... | USA | North Carolina | NC | None | flash | NaN | Occurred : Reported: 4/20/2021 7:20:59 PM 19... | Glowing red and blue Orb \n \nAt first there w... | NaN | NaN |
| 133148 | Multiple lights ( 13 -16 ) in formation trave... | USA | Scotland | TX | None | formation | 4 minutes | Occurred : 21:03 (Entered as : 04072021 21:0... | Multiple lights ( 13 -16 ) in formation trave... | 33.654500 | -98.461600 |
| 134728 | Voids in space - energy beam | USA | Town Creek | AL | None | other | 2 seconds | Occurred : 22:00 (Entered as : 22:00) Report... | Voids in space - energy beam \n \nOne night a ... | 34.653200 | -87.426100 |
| 136091 | look if been filming for almost two years to s... | USA | Canyon City | CO | None | other | 45 minutes | Occurred : Reported: 8/3/2020 10:52:12 PM 22... | look if been filming for almost two years to s... | NaN | NaN |
| 136382 | Let w flying triangular craft..comp silence..f... | Canada | Kitchener | ON | None | triangle | NaN | Occurred : Reported: 7/4/2020 12:28:03 PM 12... | Let w flying triangular craft..comp silence..f... | 43.445905 | -80.470033 |
211 rows × 11 columns
## call update_date_time_dev()
# create mask with the missing values for the date_time column
mask = ufo_df[["date_time"]].isnull().all(axis=1)
# get the date time data for these rows with missing values in the date_time column
ufo_df[mask] = update_date_time_dev(ufo_df[mask])
# create mask with the missing values for the date_time column
mask = ufo_df[["date_time"]].isnull().all(axis=1)
# get the entire row of data for those rows
print("count of rows: ", ufo_df[mask].shape[0])
ufo_df[mask]
count of rows: 0
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude |
|---|
# verify that date_time column has no missing values; legit data rows have date_time column filled
# make a mask with the missing values for date_time column
mask = ufo_df[["date_time"]].isnull().all(axis=1)
# get the entire row of data for those rows
ufo_df[mask]
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude |
|---|
# print the working dataframe's rows
print(ufo_df.shape[0])
139677
# fix_datetime takes dataframe and checks for years; errors for rows 76632, 122963, 118625
# NOTE: This function takes a while to run (about 3-4 minutes)
# NOTE: Range of pandas datetime objects is between 1677-09-21 and 2262-04-11
def fix_datetime_oob_error(df):
# convert the date_time column to a datetime format with errors='ignore'
df["date_time"] = pd.to_datetime(df["date_time"], errors="ignore")
# extract the year from the date_time column using strftime method
df["year"] = df["date_time"].apply(lambda x: x.strftime("%Y") if isinstance(x, pd.Timestamp) else x)
# filter the dataframe by the year greater than 1677;
df = df.loc[df["year"] > "1677"]
# drop the year column as it is no longer needed
df = df.drop("year", axis=1)
return df
ufo_df = fix_datetime_oob_error(ufo_df)
ufo_df.shape[0]
139674
import pytz
# convert_timestamp converts a datetime value to timestamped format (epoch time)
def convert_timestamp(datetime_value):
if not isinstance(datetime_value, dtime.datetime):
temp_value = str(datetime_value)
# search for a match to the date time pattern from temp_value
matches = re.search("(\d{1,2})-(\d{1,2})-(\d{2,4})T(\d{1,2}):(\d{1,2}):(\d{1,2})", temp_value)
if matches:
# parse the datetime_value and formats the values into a datetime object
datetime_value = dtime.datetime.strptime(datetime_value, "%Y-%m-%dT%H:%M:%S")
else:
# no match, so search for another date time pattern
matches = re.search("(\d{1,2})-(\d{1,2})-(\d{2,4})\s*(\d{1,2}):(\d{1,2}):(\d{1,2})", temp_value)
if matches:
# parse the datetime_value and formats the values into a datetime object
datetime_value = dtime.datetime.strptime(datetime_value, "%Y-%m-%d %H:%M:%S")
else:
# no match, so search for yet another date time pattern
matches = re.search("(\d{1,2})-(\d{1,2})-(\d{2,4})\s*(\d{1,2}):(\d{1,2})", temp_value)
if matches:
# parse the datetime_value and formats the values into a datetime object
datetime_value = dtime.datetime.strptime(datetime_value, "%Y-%m-%d %H:%M")
else:
# unknown format so raise ValueError
raise ValueError(datetime_value, ": datetime value is unknown.")
# get the local UTC timezone for datetime_value
datetime_value = pytz.utc.localize(datetime_value)
return datetime_value.timestamp()
# convert_datetime_column applies the convert_timestamp function to datetime column and assigns
# the transformed values to a new column called transformed_datetime
def convert_datetime_column(df):
df["date_time_epoch"] = df["date_time"].apply(convert_timestamp)
return df
# calls the convert_datetime_column for the ufo_df as an argument
ufo_df = convert_datetime_column(ufo_df)
ufo_df
| summary | country | city | state | date_time | shape | duration | stats | text | city_latitude | city_longitude | date_time_epoch | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MADAR Node 100 | USA | Mountlake Terrace | WA | 2019-06-23T18:53:00 | NaN | NaN | Occurred : 6/23/2019 18:53 (Entered as : 06/2... | MADAR Node 100 | 47.794100 | -122.306600 | 1.561316e+09 |
| 1 | Steady flashing object with three lights hover... | USA | Hamden | CT | 2019-06-23T20:00:00 | light | 5 hours | Occurred : 6/23/2019 20:00 (Entered as : 6/23... | Steady flashing object with three lights hover... | 41.373941 | -72.921325 | 1.561320e+09 |
| 2 | Group of several orange lights, seemingly circ... | USA | Charlottesville | VA | 2019-06-20T23:28:00 | circle | 15 seconds | Occurred : 6/20/2019 23:28 (Entered as : 06/2... | Group of several orange lights, seemingly circ... | 38.055968 | -78.494482 | 1.561073e+09 |
| 3 | Dropped in flashed a few times and shot off 5 ... | USA | Lincoln Park | MI | 2019-06-21T00:00:00 | light | 2 minutes | Occurred : 6/21/2019 00:00 (Entered as : 06/2... | Dropped in flashed a few times and shot off 5 ... | 42.238500 | -83.178300 | 1.561075e+09 |
| 4 | Location: While traveling in a TGV, from Lille... | France | Douai | nan | 2019-06-07T20:00:00 | cigar | 5 minutes | Occurred : 6/7/2019 20:00 (Entered as : 06/07... | Location: While traveling in a TGV, from Lill... | NaN | NaN | 1.559938e+09 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 141256 | See craft every evening when I get out of work... | USA | Branford | CT | 2021-12-19T23:00:00 | circle | NaN | Occurred : 12/19/2021 23:00 (Entered as : 12/... | See craft every evening when I get out of work... | 41.277600 | -72.811500 | 1.639955e+09 |
| 141257 | MADAR Node 70 | USA | St Louis | MO | 2022-03-09T00:48:00 | NaN | NaN | Occurred : 3/9/2022 00:48 (Entered as : 03/09... | MADAR Node 70 \n \n | 38.626989 | -90.307029 | 1.646787e+09 |
| 141258 | A very small white light hovering above the cl... | USA | Batavia | IL | 2013-04-30T02:19:00 | circle | 45 seconds | Occurred : 4/30/2013 02:19 (Entered as : 04/3... | A very small white light hovering above the cl... | 41.847800 | -88.307500 | 1.367288e+09 |
| 141259 | I was young. You know what? It was pretty fuck... | USA | Liberty Lake | WA | 1995-07-21T09:00:00 | circle | 1/12 | Occurred : 7/21/1995 09:00 (Entered as : 07/2... | I was young. You know what? It was pretty ((... | 47.651700 | -117.083800 | 8.063172e+08 |
| 141260 | While driving at night, I watched two blue-gre... | USA | Fredericksburg | VA | 2019-12-24T18:30:00 | sphere | 4 seconds | Occurred : 12/24/2019 18:30 (Entered as : 012... | While driving at night, I watched two blue-gre... | 38.299511 | -77.500674 | 1.577212e+09 |
139674 rows × 12 columns
# convert the date_time index to epoch timestamps in seconds, for verificatiion
#ufo_df["date_time_epoch2"] = pd.to_datetime(ufo_df["date_time"]).astype(np.int64) / 10**9
#ufo_df
ufo_df = ufo_df.drop(columns=["stats","text"])
ufo_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 139674 entries, 0 to 141260 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 summary 139602 non-null object 1 country 136603 non-null object 2 city 139363 non-null object 3 state 136603 non-null object 4 date_time 139674 non-null object 5 shape 133554 non-null object 6 duration 132298 non-null object 7 city_latitude 114352 non-null float64 8 city_longitude 114352 non-null float64 9 date_time_epoch 139674 non-null float64 dtypes: float64(3), object(7) memory usage: 11.7+ MB
# convert the duration values to lowercase
ufo_df["duration"] = ufo_df["duration"].str.lower()
# fix_duration() takes a string and returns a string object with the time value and time unit
def fix_duration(s):
# convert s to a string
s = str(s)
# searches for a pattern with a valid time value and time unit from string s
matches = re.search(r"(\d+(?:\.\d+)?)\s*(-)?\s*(\d+(?:\.\d+)?)?\s*(:)?(\d{1,2})?\s*(s|sec?|m|min|h|hr|ho)?", s)
if matches:
# get the numeric value and the unit from the match object
value1 = matches.group(1)
dash = matches.group(2)
value2 = matches.group(3)
colon = matches.group(4)
value3 = int(matches.group(5)) if matches.group(5) else 0 # for "dd:dd" data
unit = matches.group(6)
# cast value1 and value2 to float for rounding purpose
value1 = float(value1)
if value2:
value2 = float(value2)
# round value1 and value2 up
value1 = round(value1)
if value2:
value2 = round(value2)
# for values like "12:34" would return as "12:34:00" rather than None
if value2 is None and colon == ":" and unit is None:
if len(str(value3)) == 2 or len(str(value3)) != 1:
time_val = "0 days " + str(value1) + ":" + str(value3) + ":00"
elif len(str(value3)) == 1:
time_val = "0 days " + str(value1) + ":0" + str(value3) + ":00"
return time_val
# check the unit and return the time value and unit as a strings object
if (unit == "m") or (unit == "min"):
# if there is a value2, add it to the value1
if value2:
# if dash is present need to average the values for minutes
if dash:
# get the average of the values and round up by use of the math.ceil function
avg = math.ceil((value1 + value2) / 2)
# find the length of avg to place one or no zeroes before it
if len(str(avg)) == 2 or len(str(avg)) != 1:
time_val = "0 days 00:" + str(avg) + ":00"
elif len(str(avg)) == 1:
time_val = "0 days 00:0" + str(avg) + ":00"
# else no dash means addition of value1 and value2
else:
sum = value1 + value2
# find the length of sum to place one or no zeroes before it
if len(str(sum)) == 2 or len(str(sum)) != 1:
time_val = "0 days 00:" + str(sum) + ":00"
elif len(str(sum)) == 1:
time_val = "0 days 00:0" + str(sum) + ":00"
# else only value1 is viable
else:
# find the length of value1 to place one or no zeroes before it
if len(str(value1)) == 2 or len(str(value1)) != 1:
time_val = "0 days 00:" + str(value1) + ":00"
elif len(str(value1)) == 1:
time_val = "0 days 00:0" + str(value1) + ":00"
return time_val
elif (unit == "s") or (unit == "sec"):
# if there is a value2, add it to the value1
if value2:
# if dash is present need to average the values for seconds
if dash:
# get the average of the values and round up
avg = math.ceil((value1 + value2) / 2)
# find the length of avg to place one or no zeroes before it
if len(str(avg)) == 2 or len(str(avg)) != 1:
time_val = "0 days 00:00:" + str(avg)
elif len(str(avg)) == 1:
time_val = "0 days 00:00:0" + str(avg)
# else no dash means addition of value1 and value2
else:
sum = value1 + value2
# find the length of sum to place one or no zeroes before it
if len(str(sum)) == 2 or len(str(sum)) != 1:
time_val = "0 days 00:00:" + str(sum)
elif len(str(sum)) == 1:
time_val = "0 days 00:00:0" + str(sum)
# else only value1 is viable
else:
# find the length of value1 to place one or no zeroes before it
if len(str(value1)) == 2 or len(str(value1)) != 1:
time_val = "0 days 00:00:" + str(value1)
elif len(str(value1)) == 1:
time_val = "0 days 00:00:0" + str(value1)
return time_val
elif (unit == "h") or (unit == "hr") or (unit == "ho"):
# if there is a value2, add it to the value1
if value2:
# if dash is present need to average the values for hours
if dash:
# get the average of the values and round up
avg = math.ceil((value1 + value2) / 2)
time_val = "0 days " + str(avg) + ":00:00"
# else no dash means addition of value1 and value2
else:
# no dash means a simple addition of two numbers
time_val = "0 days " + str(value1 + value2) + ":00:00"
# else only value1 is viable
else:
# get row 18317 error: "116 hwy. 25" and make its time value None
if (value1 == 116) and (value3 == 0):
return None
else:
time_val = "0 days " + str(value1) + ":00:00"
return time_val
else:
return None
else:
return None
# apply fix_duration to each duration column value and assign it back to the dataframe
ufo_df["duration"] = ufo_df["duration"].apply(fix_duration)
# apply pd.to_timedelta function to each duration column value and return a series of timedeltas
ufo_df["duration"] = ufo_df["duration"].apply(pd.to_timedelta)
#ufo_df
ufo_df: dataframe that has been error-fixed for city, state, country, data_time, and location columns (EXCEPT for duration column), the missing values imputed for the date_time column, and hoax-filtered
Number of rows: 139674
ufo_reset_df: copy of ufo_df made for reindexing the dataframe to date_time
duration_df : the latest form of ufo_df that has been error fixed for all the columns that ufo_df has been fixed (INCLUDING duration column), the missing values imputed for the date_time column, and hoax-filtered
Number of rows: 116714
compare_df: made for temporary uses and subsetting ufo_df or ufo_reset_df
# copy original ufo_df to ufo_reset_df
ufo_reset_df = ufo_df.copy()
# reset the index to date_time
ufo_reset_df = ufo_reset_df.set_index("date_time")
# sort in ascending order (chronological order)
ufo_reset_df = ufo_reset_df.sort_index()
type(ufo_reset_df.index)
pandas.core.indexes.base.Index
ufo_reset_df.head()
| summary | country | city | state | shape | duration | city_latitude | city_longitude | date_time_epoch | |
|---|---|---|---|---|---|---|---|---|---|
| date_time | |||||||||
| 1762-12-11 00:00:00 | Reported in a London paper in 1762: a bright l... | None | Lulworth | None | NaN | 0 days 00:01:00 | NaN | NaN | -6.534086e+09 |
| 1860-08-02 00:00:00 | meteor or fireball passage | USA | Cherokee | NC | fireball | NaT | 35.509200 | -83.315700 | -3.452803e+09 |
| 1861-03-29 00:00:00 | Strange, cross-shaped, object witnessed over N... | USA | New York | NY | cross | NaT | 40.752564 | -73.984457 | -3.432154e+09 |
| 1864-05-16 00:00:00 | This is a UFO report by a respected Confederat... | USA | Cave Spring | GA | light | NaT | 34.116500 | -85.335000 | -3.333312e+09 |
| 1864-05-16 00:00:00 | In 1864, prior to airplanes, a white object wa... | USA | Cave Springs | GA | unknown | NaT | NaN | NaN | -3.333312e+09 |
# apply log transformation to the date_time_epoch column
ufo_reset_df["date_time_log"] = np.log(ufo_reset_df["date_time_epoch"])
# reverse the log transformation to get back the original epoch timestamps
ufo_reset_df["date_time_epoch_inv"] = np.exp(ufo_reset_df["date_time_log"])
# convert the epoch timestamps to human-readable datetime type
ufo_reset_df["date_time_inv"] = pd.to_datetime(ufo_reset_df["date_time_epoch_inv"], unit="s")
# Print the first 5 rows of the dataframe
#print(ufo_reset_df.head(30))
ufo_reset_df.shape[0]
139674
# make line graphs of datetime and normalized datetime (logged transformed) by count
compare_df = ufo_reset_df.copy()
# get the datetime from the index of compare_df
compare_df["date"] = pd.to_datetime(compare_df.index)
# group by date values
untran_datetime = compare_df["date"].groupby(compare_df["date"].dt.date).size()
# sort by count in descending order
untran_datetime = untran_datetime.sort_values(ascending=False)
# reset index to get a dataframe
untran_datetime = untran_datetime.reset_index(name="count")
# reverse the log transformation values and convert the date_time_epoch values to a datetime type for plotting
compare_df["date_log"] = pd.to_datetime(np.exp(compare_df["date_time_log"]), unit="s")
# group by date_log values
log_datetime = compare_df["date_log"].groupby(compare_df["date_log"].dt.date).size()
# sort by count in descending order
log_datetime = log_datetime.sort_values(ascending=False)
# reset index to get a dataframe
log_datetime = log_datetime.reset_index(name="count")
# print the resulting dataframe's total count of rows
print("Total rows of data: ", compare_df.shape[0])
# plot graphs of original datetime and normalized datetime columns
untran_datetime.plot(x="date", xlabel="datetime", y="count", ylabel="count", kind="line", legend=True)
plt.title("Original Datetime Values of UFO Sightings")
plt.xlabel("Date_time")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("plots/datetime_distrib_line.png")
plt.show()
# normalized datetime graph
log_datetime.plot(x="date_log", xlabel="datetime", y="count", ylabel="count", kind="line", legend=True)
plt.title("Normalized Datetime Values (log-transformed) of UFO Sightings")
plt.xlabel("Normalized Date_time")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
Total rows of data: 139674
normal_df_dur = ufo_reset_df # reference the duration_df
# convert the timedelta column, duration, to a numerical column in seconds and store values in a
# new column called duration_seconds
normal_df_dur["duration_seconds"] = normal_df_dur["duration"].dt.total_seconds()
# create an instance of MinMaxScaler
scaler = MinMaxScaler()
# fit and transform the duration_seconds column and assign values to a new column "duration_norm"
normal_df_dur["duration_norm"] = scaler.fit_transform(normal_df_dur[["duration_seconds"]])
# print the resulting dataframe's total count of rows
print("Total rows of data: ", normal_df_dur.shape[0])
# plot histograms of original and normalized duration_seconds columns
plt.figure(figsize=(10, 5))
# plot duration times
plt.subplot(1, 2, 1)
plt.hist(normal_df_dur["duration_seconds"], bins=20)
plt.title("Original duration_seconds")
plt.xlim(0, 1.25e6)
plt.xlabel("Seconds")
plt.ylabel("Frequency")
# plot normalized duration times
plt.subplot(1, 2, 2)
plt.hist(normal_df_dur["duration_norm"], bins=20)
plt.title("Normalized duration_seconds using MinMaxScaler")
plt.xlabel("Normalized seconds")
plt.ylabel("Frequency")
plt.show()
Total rows of data: 139674
ufo_reset_df.shape[0]
139674
# create a line graph for duration time
compare_df = ufo_reset_df.copy() # copy duration_df in case we alter it
# get the time in minutes by getting total_seconds and dividing by 60
time_min = compare_df["duration"].dt.total_seconds() / 60
# convert to a dataFrame with column name duration_norm
time_min = time_min.to_frame(name="duration")
# group by the duration column
time_min = time_min.groupby("duration").size()
# plot graph of duration min by counts for all countries
time_min.plot(x="duration", y="count", kind="line", figsize=(6,4))
plt.title("Duration (min) by Count of UAP Sightings in All Countries")
plt.xlabel("Duration (minutes)")
plt.ylabel("Count")
plt.xlim(0, 100) # set the xlim to year
plt.savefig("plots/duration_distrib_line.png")
plt.show()
# create a line graph for normalized duration time
compare_df = normal_df_dur.copy() # copy normal_df_dur
# convert the duration_normalized column to a timedelta type
compare_df["duration_norm"] = pd.to_timedelta(compare_df["duration_norm"])
# get the time in minutes by getting total_seconds and dividing by 60
time_min = compare_df["duration_norm"].dt.total_seconds() / 60
# convert to a dataFrame with column name duration_norm
time_min = time_min.to_frame(name="duration_norm")
# group by duration column
time_min = time_min.groupby("duration_norm").size()
# plot graph of duration min by counts for all countries
time_min.plot(x="duration_norm", y="count", kind="line", figsize=(6,4))
plt.title("Normalized Duration (min) by Count of UAP Sightings\nin All Countries: normalized data")
plt.xlabel("Duration (minutes)")
plt.ylabel("Count")
plt.savefig("plots/norm_duration_distrib_line.png")
plt.show()
#ufo_reset_df
# save ufo_reset_df dataframe as a csv file
#ufo_reset_df.to_csv("data/ufo_reset_df.csv")
Example:
If we have Allentown, PA, we would get frequencies by counting the number of reports of sightings over years between 2009 and 2012.
The dataframe has for year 2009, 3 observations --> freq = 3
for year 2010, 1 observation --> freq = 1
for year 2011, 1 observation --> freq = 1
for year 2012, 8 observations --> freq = 8
UFO Dataframe (Relevant Columns)
Date_time Year City Country Count
2009-02-15 23:00:00 2009 Allentown USA 3
2009-03-06 20:45:00 2009 Allentown USA 3
2009-03-22 23:54:00 2009 Allentown USA 3
2010-10-02 20:00:00 2010 Allentown USA 1
2011-10-20 07:30:00 2011 Allentown USA 1
2012-01-15 23:20:00 2012 Allentown USA 8
2012-04-05 17:45:00 2012 Allentown USA 8
2012-06-30 02:30:00 2012 Allentown USA 8
2012-07-06 22:40:00 2012 Allentown USA 8
2012-08-25 21:00:00 2012 Allentown USA 8
2012-10-17 20:45:00 2012 Allentown USA 8
2012-11-19 15:00:00 2012 Allentown USA 8
2012-11-21 17:45:00 2012 Allentown USA 8# calculate count and frequency and assign it to new columns "countx" and freq"
# convert the date_time column to datetime format
ufo_reset_df["date_time"] = pd.to_datetime(ufo_reset_df.index)
# get the year and month from the date_time column
ufo_reset_df["year"] = ufo_reset_df["date_time"].dt.year
# calculate the frequencies of the group columns by selected features
ufo_reset_df["count_col"] = ufo_reset_df.groupby(["year","city","country"])["year"].transform("count")
# calculate the probability of a ufo sighting
ufo_reset_df["freq"] = (ufo_reset_df["count_col"] / (len(ufo_reset_df))) * 100
# get the sum of all freq; not sure if I need this
sum_freq = ufo_reset_df["freq"].sum()
len(ufo_reset_df)
139674
# store sum of all frequencies for use later
sum_freq = ufo_reset_df["freq"].sum()
sum_freq
470.0080186720505
# get different feature groupings and display the frequencies for each group
freq_df = ufo_reset_df[["year","city","country","freq"]]
# drop na values from freq_vals
freq_df = freq_df.dropna()
# print the frequencies and title of table
print("** Note: Showing only frequencies last 10 rows for each dataframe")
print("\nFrequency by Year, City, and Country\n", freq_df.tail(10).to_string(index=False))
# make a line graph of frequency by year for UFO sightings
freq_df.plot(x="year", xlabel="year", y="freq", ylabel="frequency", kind="line", legend=True, title="Frequency by Year, City, Country of UFO Sightings")
plt.legend(loc="upper left")
plt.xlim(1950, 2023) # set the xlim to earliest year of reports to 2022
plt.tight_layout()
# save as png file
plt.savefig("plots/comp_freq_line.png")
plt.show()
** Note: Showing only frequencies last 10 rows for each dataframe Frequency by Year, City, and Country year city country freq 2022 Tucson USA 0.012171 2022 Washington Dc USA 0.002864 2022 Mentor Headlands USA 0.000716 2022 East Greenwich USA 0.001432 2022 Franklin USA 0.007875 2022 Essex Canada 0.000716 2022 Columbus USA 0.014319 2022 Johannesburg South Africa 0.002148 2022 Newark USA 0.002864 2022 Ahmedabad Unknown 0.000716
# plot the histogram of frequency values by year, city, and country
# get the freq and year values place in series objects
freq_vals = freq_df[["freq"]]
# drop na values from freq_vals
freq_vals = freq_vals.dropna()
# make a histogram with freq_vals
plt.rcParams["figure.figsize"] = (6, 4)
plt.hist(freq_vals, label="freq", bins=20)
# set title and formatting
plt.title("Histogram of Frequencies of UFO Sightings")
plt.legend(loc="upper right")
plt.xlim(0.000,0.03)
plt.xticks([0.000,0.001,0.003,0.005,0.007,0.008,0.01,0.015,0.02,0.025,0.03,0.035], rotation=90)
plt.xlabel("Frequency")
plt.ylabel("Count")
plt.tight_layout()
# save as png file
plt.savefig("plots/freq_hist.png")
plt.show()
#len(ufo_reset_df)
#### In case we use the log transformed datetime and need to revert the datetime to readable format ####
#### Reverse the timestamped normalized values #####
# inverse_normalize reverses normalized datetime values with max_value and min_value for int64
def inverse_normalize(date_value, min_value, max_value):
return date_value * (max_value - min_value) + min_value
# load data
ufo_data = ufo_reset_df.copy()
# strip the spaces from the column names
ufo_data.columns = ufo_data.columns.str.strip()
# encode categorical features
le = LabelEncoder()
# encode the location features
le.fit(ufo_data["city"])
encoded_city = le.transform(ufo_data["city"])
decoded_city = le.inverse_transform(encoded_city) # For inversing the LabelEncoder later
ufo_data["city"] = le.transform(ufo_data["city"])
le.fit(ufo_data["state"])
encoded_state = le.transform(ufo_data["state"])
decoded_state = le.inverse_transform(encoded_state) # For inversing the LabelEncoder
ufo_data["state"] = le.transform(ufo_data["state"])
le.fit(ufo_data["country"])
encoded_ctry = le.transform(ufo_data["country"])
decoded_ctry = le.inverse_transform(encoded_ctry) # For inversing the LabelEncoder
ufo_data["country"] = le.transform(ufo_data["country"])
le.fit(ufo_data["shape"])
encoded_sh = le.transform(ufo_data["shape"])
decoded_sh = le.inverse_transform(encoded_sh) # For inversing the LabelEncoder
ufo_data["shape"] = le.transform(ufo_data["shape"])
# get the year and month
ufo_data["year"] = pd.DatetimeIndex(ufo_data.index).year
ufo_data["month"] = pd.DatetimeIndex(ufo_data.index).month
# for duration, it has been normalized already as ufo_reset_df["duration_normalized"]
print("ufo_data[duration_norm]: ", type(ufo_data["duration_norm"].dtypes))
ufo_data[duration_norm]: <class 'numpy.dtype[float64]'>
# inversing the LabelEncoder(): display the encoded and decoded location values
print("encoded city = ", encoded_city)
print("decoded city = ", decoded_city)
print("\nencoded state = ", encoded_state)
print("decoded state = ", decoded_state)
print("\nencoded country = ", encoded_ctry)
print("decoded country = ", decoded_ctry)
print("\nencoded shape = ", encoded_sh)
print("decoded shape = ", decoded_sh)
encoded city = [11002 3330 13169 ... 9076 13180 144] decoded city = ['Lulworth' 'Cherokee' 'New York' ... 'Johannesburg' 'Newark' 'Ahmedabad'] encoded state = [211 119 136 ... 68 58 209] decoded state = [None 'NC' 'NY' ... 'GAUTENG' 'DE' 'nan'] encoded country = [325 301 301 ... 254 301 308] decoded country = [None 'USA' 'USA' ... 'South Africa' 'USA' 'Unknown'] encoded shape = [23 11 5 ... 21 14 22] decoded shape = [nan 'fireball' 'cross' ... 'triangle' 'light' 'unknown']
# use GridSearchCV to find the best parameters for HistGradientBoostingRegressor
#### WARNING: this block take a bit long to run (3 - 4 min)
# make the features dataframe X
X = pd.DataFrame(ufo_data[["year","month","city","state","country","shape","duration_norm"]])
X = X.dropna()
# make the target variable y
y = X.groupby(["year","month","city","state","country","shape","duration_norm"])["year"].transform("count")
# calculate the freq of these groupings
y = (y / len(X)) * 100
# choose the parameters and settings via a dictionary
parameters = {
"loss":["squared_error","absolute_error","gamma","poisson"],
"learning_rate":[0.005,0.1,0.5,0.8],
"max_iter":[100,125,150],
"max_leaf_nodes":[28,31,35,40],
}
# create an instance of GridSearchCV with GradientBoostingRegressor as estimator
grid = GridSearchCV(HistGradientBoostingRegressor(),parameters,scoring="r2",cv=5,n_jobs=-1)
# fit the grid search and display the best parameters' scores
grid.fit(X,y)
print(grid.best_params_,grid.best_score_)
{'learning_rate': 0.005, 'loss': 'squared_error', 'max_iter': 100, 'max_leaf_nodes': 35} -0.0008889433914372802
# use HistGradientBoostingRegressor to find highest scoring features
# make the features dataframe X
X = pd.DataFrame(ufo_data[["year","month","city","state","country","shape","duration_norm"]])
X = X.dropna()
# make the target variable y
y = X.groupby(["year","month","city","state","country","shape","duration_norm"])["year"].transform("count")
# calculate the freq of this grouping
y = (y / len(X)) * 100
# create an instance of the model
hgbr = HistGradientBoostingRegressor(loss="squared_error", learning_rate=0.005, max_iter=125, max_leaf_nodes=31)
# fit the classifier on the dataset
hgbr.fit(X, y)
# get the permutation feature importances
result = permutation_importance(hgbr,X,y,scoring="r2",n_repeats=10,n_jobs=-1)
importances = result.importances_mean
# get the feature names
names = X.columns
# sort the features by their importances in descending order
indices = np.argsort(importances)[::-1]
print("Score ranking:")
# create empty list for features to be selected
selected_features =[]
for i in range(X.shape[1]):
print(i + 1, ". Feature ", indices[i], " ", importances[indices[i]])
# add name to the selected_features list
selected_features.append(names[indices[i]])
# get the top 4 features
top_features_names = selected_features[:4]
print("\nHistGradientBoostingRegressor selected features: ", top_features_names)
Score ranking: 1 . Feature 2 0.010265428202056504 2 . Feature 0 0.00764948891460353 3 . Feature 3 0.007370510861590629 4 . Feature 1 0.005041313475362297 5 . Feature 6 0.0012306300842096717 6 . Feature 5 0.0001573050213683369 7 . Feature 4 1.3172080073775571e-05 HistGradientBoostingRegressor selected features: ['city', 'year', 'state', 'month']
# find best parameter settings for SelectKBest selector
# make the features dataframe X
X = pd.DataFrame(ufo_data[["year","month","city","state","country","shape","duration_norm"]])
X = X.dropna()
# make the target variable y
y = X.groupby(["year","month","city","state","country","shape","duration_norm"])["year"].transform("count")
# calculate the freq of this grouping
y = (y / len(X)) * 100
# make an instance of selector
selector = SelectKBest()
# choose the parameters for the selector
parameters = {
"k":[2,3,4,5,6,7],
"score_func":[r_regression, f_classif, chi2, f_regression]
}
# instantiate GridSearchCV with the selector, scoring method, and cv=5
cv = GridSearchCV(selector, parameters, scoring="neg_mean_squared_error", cv=5)
# fit the training and target set and flatten y array with ravel function
cv.fit(X,y.values.ravel())
# display the results
print("Best parameters for the SelectKBest are: ", cv.best_params_)
Best parameters for the SelectKBest are: {'k': 2, 'score_func': <function r_regression at 0x0000019DFFE18F70>}
# use the same X features and y target that we did for the SVR model
# make the features dataframe X
X = pd.DataFrame(ufo_data[["year","month","city","state","country","shape","duration_norm"]])
X = X.dropna()
# make the target variable y
y = X.groupby(["year","month","city","state","country","shape","duration_norm"])["year"].transform("count")
# calculate the freq of this grouping
y = (y / len(X)) * 100
# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# make an instance of the selector with r_regression as the scoring function and k=3
selector = SelectKBest(r_regression, k=3)
# fit the selector on the training data
selector.fit(X_train, y_train)
# get the scores of the features
scores = selector.scores_
print("Scores:", scores)
# get the selected features
selected_features = X.columns[selector.get_support()]
print("\nSelectKBest's selected features:", selected_features)
Scores: [ 0.01371672 0.00046502 0.01140011 -0.01914446 0.0161268 0.00182053 -0.00187334] SelectKBest's selected features: Index(['year', 'city', 'country'], dtype='object')
# remove outliers with years < 1930
# copy the ufo_rest_df to compare_df
compare_df = ufo_reset_df.copy()
# get the year from dataframe and subset it with year >= 1930
compare_df["year"] = pd.DatetimeIndex(ufo_data.index).year
compare_df = compare_df.loc[compare_df["year"] >= 1930]
# subset X into feature and the freq columns
X = compare_df[["year","city","country","freq"]]
# drop na values
X = X.dropna()
X = X[["year","freq"]]
# split data into X_train set, having years < 2022, and X_test set, having year == 2022
X_train = X.loc[X["year"] < 2021]
X_test = X.loc[X["year"] >= 2021]
# copy X_test to actual_freq for analysis later
actual_freq = X_test.copy()
# create the "ds" and "y" columns for the dataframe
# make date_time index into datetime type and put into a date column
X_train["date"] = pd.to_datetime(X_train.index)
# make a new df with date and freq columns
X_train = X_train[["date", "freq"]]
# drop duplicate rows
X_train = X_train.drop_duplicates()
# rename to expected column names for Prophet
X_train.columns = ["ds", "y"]
# rename freq to y for calculating metrics later
X_test = X_test.rename(columns={"freq": "y"})
# rename to expected column names for Prophet for cross validation step below
X.columns = ["ds", "y"]
# create and fit the additive model
m_add = Prophet()
m_add.add_seasonality(name="monthly", period=30.5, fourier_order=5)
m_add.fit(X_train)
# create and fit the multiplicative model
m_mult = Prophet(seasonality_mode="multiplicative")
m_mult.add_seasonality(name="monthly", period=30.5, fourier_order=5)
m_mult.fit(X_train)
# set future dataframe to 24 periods to compare X-_test with the predictions
future = m_add.make_future_dataframe(periods=24, freq="MS")
# make predictions
forecast_add = m_add.predict(future)
forecast_mult = m_mult.predict(future)
23:51:57 - cmdstanpy - INFO - Chain [1] start processing 23:52:53 - cmdstanpy - INFO - Chain [1] done processing 23:53:14 - cmdstanpy - INFO - Chain [1] start processing 23:54:05 - cmdstanpy - INFO - Chain [1] done processing
# find the better Prophet model
from prophet import Prophet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# create and fit an additive model
m_add = Prophet()
m_add.add_seasonality(name="monthly", period=30.5, fourier_order=5)
m_add.fit(X_train)
# create and fit a multiplicative model
m_mult = Prophet(seasonality_mode="multiplicative")
m_mult.add_seasonality(name="monthly", period=30.5, fourier_order=5)
m_mult.fit(X_train)
# set future dataframe to 24 periods to compare X-_test with the predictions
future = m_add.make_future_dataframe(24, freq="MS")
# make predictions
forecast_add = m_add.predict(future)
forecast_mult = m_mult.predict(future)
# slice forecast_add and forecast_mult by the rows that match length of X_test
forecast_add_sl = forecast_add[-len(X_test):]
forecast_mult_sl = forecast_mult[-len(X_test):]
# calculate MAE, MSE, R2 scores for the models
mae_add = mean_absolute_error(X_test["y"], forecast_add_sl["yhat"])
mse_add = mean_squared_error(X_test["y"], forecast_add_sl["yhat"])
r2_add = r2_score(X_test["y"], forecast_add_sl["yhat"])
mae_mult = mean_absolute_error(X_test["y"], forecast_mult_sl["yhat"])
mse_mult = mean_squared_error(X_test["y"], forecast_mult_sl["yhat"])
r2_mult = r2_score(X_test["y"], forecast_mult_sl["yhat"])
# print results for both models
print("Metric Scores for Additive Model:")
print("MSE: ", round(mse_add,2))
print("MAE: ", round(mae_add,2))
print("R2: ", round(r2_add,2))
print("Metric Scores for Multiplicative Model:")
print("MSE: ", round(mse_mult,2))
print("MAE: ", round(mae_mult,2))
print("R2: ", round(r2_mult,2))
23:55:14 - cmdstanpy - INFO - Chain [1] start processing 23:56:10 - cmdstanpy - INFO - Chain [1] done processing 23:56:32 - cmdstanpy - INFO - Chain [1] start processing 23:57:22 - cmdstanpy - INFO - Chain [1] done processing
Metric Scores for Additive Model: MSE: 0.0 MAE: 0.0 R2: -0.14 Metric Scores for Multiplicative Model: MSE: 0.0 MAE: 0.0 R2: -0.15
# make a graph for the additive model
fig_add = m_add.plot(forecast_add)
fig_add.show()
# make a graph for the multiplicative model
fig_mul = m_mult.plot(forecast_mult)
fig_mul.show()
# define the model with changepoint_prior_scale = 0.3
model = Prophet(changepoint_prior_scale=0.03)
# add seasonality to the model
model.add_seasonality(name="monthly", period=30.5, fourier_order=5)
# fit the model
model.fit(X_train)
23:59:54 - cmdstanpy - INFO - Chain [1] start processing 00:00:59 - cmdstanpy - INFO - Chain [1] done processing
<prophet.forecaster.Prophet at 0x19d9b870a00>
# cross-validate with these parameters settings; these settings were the better ones from earlier runs
df_cv = cross_validation(model, initial='730 days', period='180 days', horizon='365 days')
0%| | 0/172 [00:00<?, ?it/s]
00:01:43 - cmdstanpy - INFO - Chain [1] start processing 00:02:11 - cmdstanpy - INFO - Chain [1] done processing 00:02:12 - cmdstanpy - INFO - Chain [1] start processing 00:02:13 - cmdstanpy - INFO - Chain [1] done processing 00:02:13 - cmdstanpy - INFO - Chain [1] start processing 00:02:44 - cmdstanpy - INFO - Chain [1] done processing 00:02:45 - cmdstanpy - INFO - Chain [1] start processing 00:02:45 - cmdstanpy - INFO - Chain [1] done processing 00:02:45 - cmdstanpy - INFO - Chain [1] start processing 00:02:45 - cmdstanpy - INFO - Chain [1] done processing 00:02:46 - cmdstanpy - INFO - Chain [1] start processing 00:02:46 - cmdstanpy - INFO - Chain [1] done processing 00:02:46 - cmdstanpy - INFO - Chain [1] start processing 00:02:46 - cmdstanpy - INFO - Chain [1] done processing 00:02:47 - cmdstanpy - INFO - Chain [1] start processing 00:02:47 - cmdstanpy - INFO - Chain [1] done processing 00:02:47 - cmdstanpy - INFO - Chain [1] start processing 00:02:47 - cmdstanpy - INFO - Chain [1] done processing 00:02:48 - cmdstanpy - INFO - Chain [1] start processing 00:02:48 - cmdstanpy - INFO - Chain [1] done processing 00:02:48 - cmdstanpy - INFO - Chain [1] start processing 00:02:49 - cmdstanpy - INFO - Chain [1] done processing 00:02:49 - cmdstanpy - INFO - Chain [1] start processing 00:02:49 - cmdstanpy - INFO - Chain [1] done processing 00:02:49 - cmdstanpy - INFO - Chain [1] start processing 00:02:49 - cmdstanpy - INFO - Chain [1] done processing 00:02:50 - cmdstanpy - INFO - Chain [1] start processing 00:02:50 - cmdstanpy - INFO - Chain [1] done processing 00:02:50 - cmdstanpy - INFO - Chain [1] start processing 00:02:50 - cmdstanpy - INFO - Chain [1] done processing 00:02:50 - cmdstanpy - INFO - Chain [1] start processing 00:02:50 - cmdstanpy - INFO - Chain [1] done processing 00:02:50 - cmdstanpy - INFO - Chain [1] start processing 00:02:51 - cmdstanpy - INFO - Chain [1] done processing 00:02:51 - cmdstanpy - INFO - Chain [1] start processing 00:02:51 - cmdstanpy - INFO - Chain [1] done processing 00:02:51 - cmdstanpy - INFO - Chain [1] start processing 00:02:51 - cmdstanpy - INFO - Chain [1] done processing 00:02:51 - cmdstanpy - INFO - Chain [1] start processing 00:02:52 - cmdstanpy - INFO - Chain [1] done processing 00:02:52 - cmdstanpy - INFO - Chain [1] start processing 00:02:52 - cmdstanpy - INFO - Chain [1] done processing 00:02:52 - cmdstanpy - INFO - Chain [1] start processing 00:02:52 - cmdstanpy - INFO - Chain [1] done processing 00:02:52 - cmdstanpy - INFO - Chain [1] start processing 00:02:52 - cmdstanpy - INFO - Chain [1] done processing 00:02:53 - cmdstanpy - INFO - Chain [1] start processing 00:02:53 - cmdstanpy - INFO - Chain [1] done processing 00:02:53 - cmdstanpy - INFO - Chain [1] start processing 00:02:53 - cmdstanpy - INFO - Chain [1] done processing 00:02:53 - cmdstanpy - INFO - Chain [1] start processing 00:02:54 - cmdstanpy - INFO - Chain [1] done processing 00:02:54 - cmdstanpy - INFO - Chain [1] start processing 00:02:54 - cmdstanpy - INFO - Chain [1] done processing 00:02:54 - cmdstanpy - INFO - Chain [1] start processing 00:02:54 - cmdstanpy - INFO - Chain [1] done processing 00:02:54 - cmdstanpy - INFO - Chain [1] start processing 00:02:55 - cmdstanpy - INFO - Chain [1] done processing 00:02:55 - cmdstanpy - INFO - Chain [1] start processing 00:02:55 - cmdstanpy - INFO - Chain [1] done processing 00:02:55 - cmdstanpy - INFO - Chain [1] start processing 00:02:55 - cmdstanpy - INFO - Chain [1] done processing 00:02:56 - cmdstanpy - INFO - Chain [1] start processing 00:02:56 - cmdstanpy - INFO - Chain [1] done processing 00:02:56 - cmdstanpy - INFO - Chain [1] start processing 00:02:56 - cmdstanpy - INFO - Chain [1] done processing 00:02:56 - cmdstanpy - INFO - Chain [1] start processing 00:02:56 - cmdstanpy - INFO - Chain [1] done processing 00:02:57 - cmdstanpy - INFO - Chain [1] start processing 00:02:57 - cmdstanpy - INFO - Chain [1] done processing 00:02:57 - cmdstanpy - INFO - Chain [1] start processing 00:02:57 - cmdstanpy - INFO - Chain [1] done processing 00:02:57 - cmdstanpy - INFO - Chain [1] start processing 00:02:58 - cmdstanpy - INFO - Chain [1] done processing 00:02:58 - cmdstanpy - INFO - Chain [1] start processing 00:02:58 - cmdstanpy - INFO - Chain [1] done processing 00:02:58 - cmdstanpy - INFO - Chain [1] start processing 00:02:58 - cmdstanpy - INFO - Chain [1] done processing 00:02:59 - cmdstanpy - INFO - Chain [1] start processing 00:02:59 - cmdstanpy - INFO - Chain [1] done processing 00:02:59 - cmdstanpy - INFO - Chain [1] start processing 00:02:59 - cmdstanpy - INFO - Chain [1] done processing 00:02:59 - cmdstanpy - INFO - Chain [1] start processing 00:03:00 - cmdstanpy - INFO - Chain [1] done processing 00:03:00 - cmdstanpy - INFO - Chain [1] start processing 00:03:00 - cmdstanpy - INFO - Chain [1] done processing 00:03:00 - cmdstanpy - INFO - Chain [1] start processing 00:03:00 - cmdstanpy - INFO - Chain [1] done processing 00:03:01 - cmdstanpy - INFO - Chain [1] start processing 00:03:01 - cmdstanpy - INFO - Chain [1] done processing 00:03:01 - cmdstanpy - INFO - Chain [1] start processing 00:03:01 - cmdstanpy - INFO - Chain [1] done processing 00:03:02 - cmdstanpy - INFO - Chain [1] start processing 00:03:02 - cmdstanpy - INFO - Chain [1] done processing 00:03:02 - cmdstanpy - INFO - Chain [1] start processing 00:03:02 - cmdstanpy - INFO - Chain [1] done processing 00:03:03 - cmdstanpy - INFO - Chain [1] start processing 00:03:03 - cmdstanpy - INFO - Chain [1] done processing 00:03:03 - cmdstanpy - INFO - Chain [1] start processing 00:03:03 - cmdstanpy - INFO - Chain [1] done processing 00:03:04 - cmdstanpy - INFO - Chain [1] start processing 00:03:04 - cmdstanpy - INFO - Chain [1] done processing 00:03:04 - cmdstanpy - INFO - Chain [1] start processing 00:03:04 - cmdstanpy - INFO - Chain [1] done processing 00:03:05 - cmdstanpy - INFO - Chain [1] start processing 00:03:05 - cmdstanpy - INFO - Chain [1] done processing 00:03:05 - cmdstanpy - INFO - Chain [1] start processing 00:03:05 - cmdstanpy - INFO - Chain [1] done processing 00:03:06 - cmdstanpy - INFO - Chain [1] start processing 00:03:06 - cmdstanpy - INFO - Chain [1] done processing 00:03:07 - cmdstanpy - INFO - Chain [1] start processing 00:03:07 - cmdstanpy - INFO - Chain [1] done processing 00:03:07 - cmdstanpy - INFO - Chain [1] start processing 00:03:07 - cmdstanpy - INFO - Chain [1] done processing 00:03:08 - cmdstanpy - INFO - Chain [1] start processing 00:03:08 - cmdstanpy - INFO - Chain [1] done processing 00:03:08 - cmdstanpy - INFO - Chain [1] start processing 00:03:09 - cmdstanpy - INFO - Chain [1] done processing 00:03:09 - cmdstanpy - INFO - Chain [1] start processing 00:03:09 - cmdstanpy - INFO - Chain [1] done processing 00:03:10 - cmdstanpy - INFO - Chain [1] start processing 00:03:10 - cmdstanpy - INFO - Chain [1] done processing 00:03:11 - cmdstanpy - INFO - Chain [1] start processing 00:03:11 - cmdstanpy - INFO - Chain [1] done processing 00:03:11 - cmdstanpy - INFO - Chain [1] start processing 00:03:11 - cmdstanpy - INFO - Chain [1] done processing 00:03:12 - cmdstanpy - INFO - Chain [1] start processing 00:03:12 - cmdstanpy - INFO - Chain [1] done processing 00:03:13 - cmdstanpy - INFO - Chain [1] start processing 00:03:13 - cmdstanpy - INFO - Chain [1] done processing 00:03:14 - cmdstanpy - INFO - Chain [1] start processing 00:03:14 - cmdstanpy - INFO - Chain [1] done processing 00:03:15 - cmdstanpy - INFO - Chain [1] start processing 00:03:15 - cmdstanpy - INFO - Chain [1] done processing 00:03:16 - cmdstanpy - INFO - Chain [1] start processing 00:03:16 - cmdstanpy - INFO - Chain [1] done processing 00:03:16 - cmdstanpy - INFO - Chain [1] start processing 00:03:17 - cmdstanpy - INFO - Chain [1] done processing 00:03:17 - cmdstanpy - INFO - Chain [1] start processing 00:03:18 - cmdstanpy - INFO - Chain [1] done processing 00:03:18 - cmdstanpy - INFO - Chain [1] start processing 00:03:19 - cmdstanpy - INFO - Chain [1] done processing 00:03:19 - cmdstanpy - INFO - Chain [1] start processing 00:03:20 - cmdstanpy - INFO - Chain [1] done processing 00:03:21 - cmdstanpy - INFO - Chain [1] start processing 00:03:21 - cmdstanpy - INFO - Chain [1] done processing 00:03:22 - cmdstanpy - INFO - Chain [1] start processing 00:03:22 - cmdstanpy - INFO - Chain [1] done processing 00:03:23 - cmdstanpy - INFO - Chain [1] start processing 00:03:23 - cmdstanpy - INFO - Chain [1] done processing 00:03:24 - cmdstanpy - INFO - Chain [1] start processing 00:03:24 - cmdstanpy - INFO - Chain [1] done processing 00:03:25 - cmdstanpy - INFO - Chain [1] start processing 00:03:26 - cmdstanpy - INFO - Chain [1] done processing 00:03:27 - cmdstanpy - INFO - Chain [1] start processing 00:03:27 - cmdstanpy - INFO - Chain [1] done processing 00:03:28 - cmdstanpy - INFO - Chain [1] start processing 00:03:28 - cmdstanpy - INFO - Chain [1] done processing 00:03:29 - cmdstanpy - INFO - Chain [1] start processing 00:03:29 - cmdstanpy - INFO - Chain [1] done processing 00:03:30 - cmdstanpy - INFO - Chain [1] start processing 00:03:31 - cmdstanpy - INFO - Chain [1] done processing 00:03:32 - cmdstanpy - INFO - Chain [1] start processing 00:03:32 - cmdstanpy - INFO - Chain [1] done processing 00:03:33 - cmdstanpy - INFO - Chain [1] start processing 00:03:34 - cmdstanpy - INFO - Chain [1] done processing 00:03:35 - cmdstanpy - INFO - Chain [1] start processing 00:03:35 - cmdstanpy - INFO - Chain [1] done processing 00:03:36 - cmdstanpy - INFO - Chain [1] start processing 00:03:37 - cmdstanpy - INFO - Chain [1] done processing 00:03:38 - cmdstanpy - INFO - Chain [1] start processing 00:03:38 - cmdstanpy - INFO - Chain [1] done processing 00:03:39 - cmdstanpy - INFO - Chain [1] start processing 00:03:40 - cmdstanpy - INFO - Chain [1] done processing 00:03:41 - cmdstanpy - INFO - Chain [1] start processing 00:03:42 - cmdstanpy - INFO - Chain [1] done processing 00:03:43 - cmdstanpy - INFO - Chain [1] start processing 00:03:44 - cmdstanpy - INFO - Chain [1] done processing 00:03:45 - cmdstanpy - INFO - Chain [1] start processing 00:03:45 - cmdstanpy - INFO - Chain [1] done processing 00:03:46 - cmdstanpy - INFO - Chain [1] start processing 00:03:47 - cmdstanpy - INFO - Chain [1] done processing 00:03:48 - cmdstanpy - INFO - Chain [1] start processing 00:03:49 - cmdstanpy - INFO - Chain [1] done processing 00:03:50 - cmdstanpy - INFO - Chain [1] start processing 00:03:51 - cmdstanpy - INFO - Chain [1] done processing 00:03:52 - cmdstanpy - INFO - Chain [1] start processing 00:03:52 - cmdstanpy - INFO - Chain [1] done processing 00:03:54 - cmdstanpy - INFO - Chain [1] start processing 00:03:54 - cmdstanpy - INFO - Chain [1] done processing 00:03:55 - cmdstanpy - INFO - Chain [1] start processing 00:03:56 - cmdstanpy - INFO - Chain [1] done processing 00:03:57 - cmdstanpy - INFO - Chain [1] start processing 00:03:58 - cmdstanpy - INFO - Chain [1] done processing 00:04:00 - cmdstanpy - INFO - Chain [1] start processing 00:04:00 - cmdstanpy - INFO - Chain [1] done processing 00:04:02 - cmdstanpy - INFO - Chain [1] start processing 00:04:02 - cmdstanpy - INFO - Chain [1] done processing 00:04:04 - cmdstanpy - INFO - Chain [1] start processing 00:04:04 - cmdstanpy - INFO - Chain [1] done processing 00:04:06 - cmdstanpy - INFO - Chain [1] start processing 00:04:07 - cmdstanpy - INFO - Chain [1] done processing 00:04:09 - cmdstanpy - INFO - Chain [1] start processing 00:04:09 - cmdstanpy - INFO - Chain [1] done processing 00:04:11 - cmdstanpy - INFO - Chain [1] start processing 00:04:12 - cmdstanpy - INFO - Chain [1] done processing 00:04:13 - cmdstanpy - INFO - Chain [1] start processing 00:04:14 - cmdstanpy - INFO - Chain [1] done processing 00:04:16 - cmdstanpy - INFO - Chain [1] start processing 00:04:17 - cmdstanpy - INFO - Chain [1] done processing 00:04:19 - cmdstanpy - INFO - Chain [1] start processing 00:04:20 - cmdstanpy - INFO - Chain [1] done processing 00:04:22 - cmdstanpy - INFO - Chain [1] start processing 00:04:23 - cmdstanpy - INFO - Chain [1] done processing 00:04:25 - cmdstanpy - INFO - Chain [1] start processing 00:04:27 - cmdstanpy - INFO - Chain [1] done processing 00:04:29 - cmdstanpy - INFO - Chain [1] start processing 00:04:31 - cmdstanpy - INFO - Chain [1] done processing 00:04:33 - cmdstanpy - INFO - Chain [1] start processing 00:04:35 - cmdstanpy - INFO - Chain [1] done processing 00:04:37 - cmdstanpy - INFO - Chain [1] start processing 00:04:39 - cmdstanpy - INFO - Chain [1] done processing 00:04:41 - cmdstanpy - INFO - Chain [1] start processing 00:04:42 - cmdstanpy - INFO - Chain [1] done processing 00:04:45 - cmdstanpy - INFO - Chain [1] start processing 00:04:47 - cmdstanpy - INFO - Chain [1] done processing 00:04:49 - cmdstanpy - INFO - Chain [1] start processing 00:04:52 - cmdstanpy - INFO - Chain [1] done processing 00:04:54 - cmdstanpy - INFO - Chain [1] start processing 00:04:56 - cmdstanpy - INFO - Chain [1] done processing 00:04:59 - cmdstanpy - INFO - Chain [1] start processing 00:05:02 - cmdstanpy - INFO - Chain [1] done processing 00:05:06 - cmdstanpy - INFO - Chain [1] start processing 00:05:08 - cmdstanpy - INFO - Chain [1] done processing 00:05:11 - cmdstanpy - INFO - Chain [1] start processing 00:05:14 - cmdstanpy - INFO - Chain [1] done processing 00:05:18 - cmdstanpy - INFO - Chain [1] start processing 00:05:20 - cmdstanpy - INFO - Chain [1] done processing 00:05:25 - cmdstanpy - INFO - Chain [1] start processing 00:05:28 - cmdstanpy - INFO - Chain [1] done processing 00:05:32 - cmdstanpy - INFO - Chain [1] start processing 00:05:36 - cmdstanpy - INFO - Chain [1] done processing 00:05:41 - cmdstanpy - INFO - Chain [1] start processing 00:05:44 - cmdstanpy - INFO - Chain [1] done processing 00:05:49 - cmdstanpy - INFO - Chain [1] start processing 00:05:52 - cmdstanpy - INFO - Chain [1] done processing 00:05:58 - cmdstanpy - INFO - Chain [1] start processing 00:06:03 - cmdstanpy - INFO - Chain [1] done processing 00:06:09 - cmdstanpy - INFO - Chain [1] start processing 00:06:12 - cmdstanpy - INFO - Chain [1] done processing 00:06:19 - cmdstanpy - INFO - Chain [1] start processing 00:06:26 - cmdstanpy - INFO - Chain [1] done processing 00:06:33 - cmdstanpy - INFO - Chain [1] start processing 00:06:39 - cmdstanpy - INFO - Chain [1] done processing 00:06:46 - cmdstanpy - INFO - Chain [1] start processing 00:06:51 - cmdstanpy - INFO - Chain [1] done processing 00:06:59 - cmdstanpy - INFO - Chain [1] start processing 00:07:05 - cmdstanpy - INFO - Chain [1] done processing 00:07:12 - cmdstanpy - INFO - Chain [1] start processing 00:07:22 - cmdstanpy - INFO - Chain [1] done processing 00:07:30 - cmdstanpy - INFO - Chain [1] start processing 00:07:44 - cmdstanpy - INFO - Chain [1] done processing 00:07:52 - cmdstanpy - INFO - Chain [1] start processing 00:08:04 - cmdstanpy - INFO - Chain [1] done processing 00:08:13 - cmdstanpy - INFO - Chain [1] start processing 00:08:21 - cmdstanpy - INFO - Chain [1] done processing 00:08:30 - cmdstanpy - INFO - Chain [1] start processing 00:08:37 - cmdstanpy - INFO - Chain [1] done processing 00:08:47 - cmdstanpy - INFO - Chain [1] start processing 00:08:58 - cmdstanpy - INFO - Chain [1] done processing 00:09:07 - cmdstanpy - INFO - Chain [1] start processing 00:09:18 - cmdstanpy - INFO - Chain [1] done processing 00:09:28 - cmdstanpy - INFO - Chain [1] start processing 00:09:39 - cmdstanpy - INFO - Chain [1] done processing 00:09:50 - cmdstanpy - INFO - Chain [1] start processing 00:10:03 - cmdstanpy - INFO - Chain [1] done processing 00:10:14 - cmdstanpy - INFO - Chain [1] start processing 00:10:28 - cmdstanpy - INFO - Chain [1] done processing 00:10:40 - cmdstanpy - INFO - Chain [1] start processing 00:10:51 - cmdstanpy - INFO - Chain [1] done processing 00:11:03 - cmdstanpy - INFO - Chain [1] start processing 00:11:16 - cmdstanpy - INFO - Chain [1] done processing 00:11:29 - cmdstanpy - INFO - Chain [1] start processing 00:11:48 - cmdstanpy - INFO - Chain [1] done processing 00:12:02 - cmdstanpy - INFO - Chain [1] start processing 00:12:24 - cmdstanpy - INFO - Chain [1] done processing 00:12:39 - cmdstanpy - INFO - Chain [1] start processing 00:13:00 - cmdstanpy - INFO - Chain [1] done processing 00:13:15 - cmdstanpy - INFO - Chain [1] start processing 00:13:38 - cmdstanpy - INFO - Chain [1] done processing 00:13:54 - cmdstanpy - INFO - Chain [1] start processing 00:14:20 - cmdstanpy - INFO - Chain [1] done processing 00:14:36 - cmdstanpy - INFO - Chain [1] start processing 00:15:07 - cmdstanpy - INFO - Chain [1] done processing 00:15:24 - cmdstanpy - INFO - Chain [1] start processing 00:15:48 - cmdstanpy - INFO - Chain [1] done processing 00:16:06 - cmdstanpy - INFO - Chain [1] start processing 00:16:35 - cmdstanpy - INFO - Chain [1] done processing 00:16:53 - cmdstanpy - INFO - Chain [1] start processing 00:17:25 - cmdstanpy - INFO - Chain [1] done processing 00:17:43 - cmdstanpy - INFO - Chain [1] start processing 00:18:24 - cmdstanpy - INFO - Chain [1] done processing 00:18:43 - cmdstanpy - INFO - Chain [1] start processing 00:19:43 - cmdstanpy - INFO - Chain [1] done processing 00:20:01 - cmdstanpy - INFO - Chain [1] start processing 00:20:54 - cmdstanpy - INFO - Chain [1] done processing 00:21:13 - cmdstanpy - INFO - Chain [1] start processing 00:21:36 - cmdstanpy - INFO - Chain [1] done processing 00:21:55 - cmdstanpy - INFO - Chain [1] start processing 00:22:32 - cmdstanpy - INFO - Chain [1] done processing 00:22:52 - cmdstanpy - INFO - Chain [1] start processing 00:23:17 - cmdstanpy - INFO - Chain [1] done processing 00:23:38 - cmdstanpy - INFO - Chain [1] start processing 00:24:20 - cmdstanpy - INFO - Chain [1] done processing 00:24:41 - cmdstanpy - INFO - Chain [1] start processing 00:25:37 - cmdstanpy - INFO - Chain [1] done processing
# get the performance_metrics for the dataframe
df_p = performance_metrics(df_cv)
df_p.head()
| horizon | mse | rmse | mae | mape | mdape | smape | coverage | |
|---|---|---|---|---|---|---|---|---|
| 0 | 38 days 03:30:00 | 0.000025 | 0.004954 | 0.003043 | 1.814356 | 0.902635 | 0.836469 | 0.905959 |
| 1 | 38 days 03:59:00 | 0.000025 | 0.004954 | 0.003042 | 1.814461 | 0.902776 | 0.836473 | 0.905987 |
| 2 | 38 days 04:00:00 | 0.000025 | 0.004954 | 0.003042 | 1.814367 | 0.902635 | 0.836443 | 0.906001 |
| 3 | 38 days 04:10:00 | 0.000025 | 0.004954 | 0.003043 | 1.814285 | 0.902497 | 0.836420 | 0.906001 |
| 4 | 38 days 04:15:00 | 0.000025 | 0.004954 | 0.003043 | 1.814344 | 0.902497 | 0.836450 | 0.906001 |
df_cv.head()
| ds | yhat | yhat_lower | yhat_upper | y | cutoff | |
|---|---|---|---|---|---|---|
| 0 | 1934-09-15 | 0.000716 | 0.000716 | 0.000716 | 0.000716 | 1934-02-27 |
| 1 | 1934-09-15 | 0.000716 | 0.000716 | 0.000716 | 0.000716 | 1934-08-26 |
| 2 | 1935-06-15 | 0.000716 | 0.000716 | 0.000716 | 0.000716 | 1934-08-26 |
| 3 | 1935-06-15 | 0.000716 | 0.000716 | 0.000716 | 0.000716 | 1935-02-22 |
| 4 | 1936-07-15 | 0.000716 | 0.000716 | 0.000716 | 0.000716 | 1935-08-21 |
# Source: #3 above "Prophet Docs >> Diagnostics: Cross-validation"
# assign the param_grid with parameter values
param_grid = {
'changepoint_prior_scale': [0.001, 0.01, 0.1, 0.5],
'seasonality_prior_scale': [0.01, 0.1, 1.0, 10.0],
}
# generate all combinations of parameters via dictionary comprehension
all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
# assign empty rmse list to store each param
rmses = []
# assign cutoffs to six months apart
cutoffs = pd.to_datetime(['2019-02-15', '2019-08-15', '2020-02-15'])
# use cross validation to calculate rmse score for each parameter
for params in all_params:
# fit model with given params; X is the df
model = Prophet(**params).fit(X)
# cross validate with the cutoff dates
df_cv = cross_validation(model, cutoffs=cutoffs, horizon='365 days', parallel="processes")
df_p = performance_metrics(df_cv, rolling_window=1)
rmses.append(df_p['rmse'].values[0])
# find the best parameters by comparing RMSE score
tuning_results = pd.DataFrame(all_params)
tuning_results['rmse'] = rmses
print(tuning_results)
00:52:32 - cmdstanpy - INFO - Chain [1] start processing 00:52:52 - cmdstanpy - INFO - Chain [1] done processing 00:53:55 - cmdstanpy - INFO - Chain [1] start processing 00:54:06 - cmdstanpy - INFO - Chain [1] done processing 00:55:00 - cmdstanpy - INFO - Chain [1] start processing 00:55:19 - cmdstanpy - INFO - Chain [1] done processing 00:56:21 - cmdstanpy - INFO - Chain [1] start processing 00:56:31 - cmdstanpy - INFO - Chain [1] done processing 00:57:24 - cmdstanpy - INFO - Chain [1] start processing 00:57:46 - cmdstanpy - INFO - Chain [1] done processing 00:58:48 - cmdstanpy - INFO - Chain [1] start processing 00:59:02 - cmdstanpy - INFO - Chain [1] done processing 01:00:47 - cmdstanpy - INFO - Chain [1] start processing 01:01:06 - cmdstanpy - INFO - Chain [1] done processing 01:02:20 - cmdstanpy - INFO - Chain [1] start processing 01:02:44 - cmdstanpy - INFO - Chain [1] done processing 01:04:04 - cmdstanpy - INFO - Chain [1] start processing 01:04:40 - cmdstanpy - INFO - Chain [1] done processing 01:06:56 - cmdstanpy - INFO - Chain [1] start processing 01:08:04 - cmdstanpy - INFO - Chain [1] done processing 01:10:14 - cmdstanpy - INFO - Chain [1] start processing 01:10:55 - cmdstanpy - INFO - Chain [1] done processing 01:13:25 - cmdstanpy - INFO - Chain [1] start processing 01:14:19 - cmdstanpy - INFO - Chain [1] done processing 01:16:24 - cmdstanpy - INFO - Chain [1] start processing 01:16:59 - cmdstanpy - INFO - Chain [1] done processing 01:19:06 - cmdstanpy - INFO - Chain [1] start processing 01:20:01 - cmdstanpy - INFO - Chain [1] done processing 01:21:36 - cmdstanpy - INFO - Chain [1] start processing 01:22:38 - cmdstanpy - INFO - Chain [1] done processing 01:24:46 - cmdstanpy - INFO - Chain [1] start processing 01:25:27 - cmdstanpy - INFO - Chain [1] done processing
changepoint_prior_scale seasonality_prior_scale rmse 0 0.001 0.01 0.005033 1 0.001 0.10 0.005039 2 0.001 1.00 0.005033 3 0.001 10.00 0.005039 4 0.010 0.01 0.005083 5 0.010 0.10 0.005095 6 0.010 1.00 0.005089 7 0.010 10.00 0.005088 8 0.100 0.01 0.005052 9 0.100 0.10 0.005050 10 0.100 1.00 0.005042 11 0.100 10.00 0.005037 12 0.500 0.01 0.005049 13 0.500 0.10 0.005055 14 0.500 1.00 0.005044 15 0.500 10.00 0.005062
best_params = all_params[np.argmin(rmses)]
print(best_params)
{'changepoint_prior_scale': 0.001, 'seasonality_prior_scale': 1.0}
# retrain model with the best parameters
model = Prophet(changepoint_prior_scale=0.001, seasonality_prior_scale=1.0)
# add seasonality to the model with the default setting of seasonality_mode="additive"
model.add_seasonality(name="monthly", period=30.5, fourier_order=5)
# fit the model
model.fit(X_train)
01:27:13 - cmdstanpy - INFO - Chain [1] start processing 01:27:25 - cmdstanpy - INFO - Chain [1] done processing
<prophet.forecaster.Prophet at 0x19d82bf69a0>
# create a future dataframe that will extend to length of X_test
future = model.make_future_dataframe(periods=len(X_test))
future.tail()
| ds | |
|---|---|
| 116055 | 2041-08-11 23:30:00 |
| 116056 | 2041-08-12 23:30:00 |
| 116057 | 2041-08-13 23:30:00 |
| 116058 | 2041-08-14 23:30:00 |
| 116059 | 2041-08-15 23:30:00 |
# make forecast from the predict function
forecast = model.predict(future)
# subset the columns with the predictions and print the tail of the dataframe
forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].tail()
| ds | yhat | yhat_lower | yhat_upper | |
|---|---|---|---|---|
| 116055 | 2041-08-11 23:30:00 | 0.004958 | -0.001383 | 0.011622 |
| 116056 | 2041-08-12 23:30:00 | 0.004943 | -0.001741 | 0.011651 |
| 116057 | 2041-08-13 23:30:00 | 0.004963 | -0.002296 | 0.011760 |
| 116058 | 2041-08-14 23:30:00 | 0.004993 | -0.001459 | 0.010863 |
| 116059 | 2041-08-15 23:30:00 | 0.005045 | -0.001301 | 0.011515 |
# plot the forecast and set the figsize, formatting, and legend of the graph
fig1 = model.plot(forecast, figsize=(7,5))
ax = fig1.gca()
ax.legend(loc=2, prop={"size":10})
# save fig to png file
plt.savefig("plots/prop_observed_forecast.png")
plt.show()
# graph the plot_components of the forecast
fig2 = model.plot_components(forecast)
plt.savefig("plots/prop_trends.png")
plt.show()
# get the predictions for forecast_test by slicing the rows that matches length of X_test
forecast_test = forecast[-len(X_test):]
# get the metric scores for the test set - MAPE RMSE, MAE, RMSE
sMAPE = 100/len(X_test["y"]) * np.sum(2 * np.abs(forecast_test["yhat"].values - X_test["y"].values) / (np.abs(X_test["y"].values) + np.abs(forecast_test["yhat"].values)))
MAPE = np.mean(np.abs((X_test["y"].values - forecast_test["yhat"].values) / X_test["y"].values)) * 100
MAE = mean_absolute_error(X_test["y"], forecast_test["yhat"])
RMSE = np.sqrt(mean_squared_error(X_test["y"], forecast_test["yhat"]))
print("Metric Scores for Prophet Model")
print("sMAPE: ", round(sMAPE,2),"%")
print("MAPE: ", round(MAPE,2),"%")
print("MAE: ", round(MAE,2))
print("RMSE: ", round(RMSE,2))
Metric Scores for Prophet Model sMAPE: 100.85 % MAPE: 292.28 % MAE: 0.0 RMSE: 0.0
# load data
ufo_data = ufo_reset_df.copy()
ufo_data.dtypes
# strip the spaces from the column names
ufo_data.columns = ufo_data.columns.str.strip()
# encode categorical features
le = LabelEncoder()
# encode the location features
le.fit(ufo_data["city"])
encoded_city = le.transform(ufo_data["city"])
decoded_city = le.inverse_transform(encoded_city) # For inversing the LabelEncoder later
ufo_data["city"] = le.transform(ufo_data["city"])
le.fit(ufo_data["state"])
encoded_state = le.transform(ufo_data["state"])
decoded_state = le.inverse_transform(encoded_state) # For inversing the LabelEncoder
ufo_data["state"] = le.transform(ufo_data["state"])
le.fit(ufo_data["country"])
encoded_ctry = le.transform(ufo_data["country"])
decoded_ctry = le.inverse_transform(encoded_ctry) # For inversing the LabelEncoder
ufo_data["country"] = le.transform(ufo_data["country"])
# encode the shape feature (in case we use shape as a feature)
le.fit(ufo_data["shape"])
encoded_sh = le.transform(ufo_data["shape"])
decoded_sh = le.inverse_transform(encoded_sh) # For inversing the LabelEncoder
ufo_data["shape"] = le.transform(ufo_data["shape"])
# get the year and month
ufo_data["year"] = pd.DatetimeIndex(ufo_data.index).year
ufo_data["month"] = pd.DatetimeIndex(ufo_data.index).month
# for duration, it has been normalized already as ufo_reset_df["duration_normalized"]
print("ufo_data[duration_norm]: ", type(ufo_data["duration_norm"].dtypes))
ufo_data[duration_norm]: <class 'numpy.dtype[float64]'>
# Inversing the LabelEncoder(): display the encoded and decoded location values
print("encoded city = ", encoded_city)
print("decoded city = ", decoded_city)
print("\nencoded state = ", encoded_state)
print("decoded state = ", decoded_state)
print("\nencoded country = ", encoded_ctry)
print("decoded country = ", decoded_ctry)
print("\nencoded shape = ", encoded_sh)
print("decoded shape = ", decoded_sh)
encoded city = [11002 3330 13169 ... 9076 13180 144] decoded city = ['Lulworth' 'Cherokee' 'New York' ... 'Johannesburg' 'Newark' 'Ahmedabad'] encoded state = [211 119 136 ... 68 58 209] decoded state = [None 'NC' 'NY' ... 'GAUTENG' 'DE' 'nan'] encoded country = [325 301 301 ... 254 301 308] decoded country = [None 'USA' 'USA' ... 'South Africa' 'USA' 'Unknown'] encoded shape = [23 11 5 ... 21 14 22] decoded shape = [nan 'fireball' 'cross' ... 'triangle' 'light' 'unknown']
# subset X into feature and the freq columns
X = ufo_data[["year","city","country","freq"]]
# drop na values
X = X.dropna()
# split data into X_train set, having years < 2021, and X_test set, having year >= 2021
X_train = X.loc[X["year"] < 2021]
X_test = X.loc[X["year"] >= 2021]
# split data into y_train and y_test from X_train and X_test
y = X[["freq"]]
y_train = X_train[["freq"]]
y_test = X_test[["freq"]]
# copy y_test to actual for analysis later
actual = y_test.copy()
# drop extra frequency column
X = X.drop(columns=["freq"],axis=1)
X_train = X_train.drop(columns=["freq"],axis=1)
X_test = X_test.drop(columns=["freq"],axis=1)
# perform a GridSearchCV to get the best parameter values to build the model
# instantiate LinearRegression
model = LinearRegression()
# assign parameters and settings in param_grid
param_grid = {'fit_intercept': [True, False], 'n_jobs': [None,-1,2,4], 'positive': [True, False], }
# assign scoring to accuracy_score
scoring = make_scorer(accuracy_score)
# instantiate the grid search with the model and parameter grid settings
grid = GridSearchCV(model, param_grid, scoring=scoring, cv=5)
# fit grid search on data
grid.fit(X_train, y_train)
# print best parameters and best score
print("Best parameter values: ", grid.best_params_)
print("Best score: ", grid.best_score_)
Best parameter values: {'fit_intercept': True, 'n_jobs': None, 'positive': True}
Best score: nan
# train a linear regression model
modelLR = LinearRegression(fit_intercept=True, positive=True)
modelLR.fit(X_train, y_train)
# make predictions on the testing data and print predictions and actual values
y_pred = modelLR.predict(X_test)
lr_pred = y_pred # for model analysis later
print("Predicted frequencies: ", y_pred)
print("Actual frequencies: ", y_test.values.tolist()[:3]," ... ", y_test.values.tolist()[-3:])
# evaluate the model performance using mse, mae, and R2
mseLR = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mseLR:.2f}")
maeLR = mean_absolute_error(y_test, y_pred)
print(f"Mean absolute error: {maeLR:.2f}")
r2LR = r2_score(y_test, y_pred)
print(f"R2 score: {r2LR:.2f}")
Predicted frequencies: [[0.00499018] [0.00415902] [0.00480278] ... [0.00414911] [0.00471997] [0.00401072]] Actual frequencies: [[0.0014319057233271762], [0.0014319057233271762], [0.0014319057233271762]] ... [[0.002147858584990764], [0.0028638114466543524], [0.0007159528616635881]] Mean squared error: 0.00 Mean absolute error: 0.00 R2 score: -0.28
# plot the actual vs predicted values for the model
plt.figure(figsize=(6, 5))
# plot the histogram of actual and predicted values
plt.hist(y_test, label="actual", bins=20)
plt.hist(y_pred, label="predicted", bins=20)
# create an array of x-coordinates for the line graph
x = np.linspace(min(y_pred), max(y_pred), len(y_pred))
# plot the predicted values as a line graph
plt.plot(x, y_pred, color="r", marker=".", linestyle=":", label="predicted line")
# set title and formatting
plt.title("Actual vs Predicted Frequency of UFO Sightings")
plt.legend(loc="upper right")
plt.xlabel("Frequency")
plt.ylabel("Count")
plt.savefig("plots/lr_freq_plot.png")
plt.show()
# perform a LassoCV cross validation to get the best parameter values to build the model
# create and fit a LassoCV model with 5-fold cross-validation
reg = LassoCV(cv=5, random_state=42).fit(X, y)
# display alpha to use and score
print("Best alpha parameter value: ", reg.alphas)
print("Training score: ", reg.score(X, y))
# predict on actual data and evaluate the performance
y_pred = reg.predict(X_test)
print("MSE Score: ", mean_squared_error(y_test, y_pred))
Best alpha parameter value: None Training score: 0.01841182966491206 MSE Score: 1.3608267669304836e-05
# build a Lasso regressor model
modelLS = Lasso(random_state=42) # Use Lasso class with no alpha parameter, default is alpha=1.0
# fit the model to the data
modelLS.fit(X_train, y_train)
# make predictions on the testing data and print predictions and actual values
y_pred = modelLS.predict(X_test)
ls_pred = y_pred # for analysis later
print("Predicted frequencies: ", y_pred)
print("Actual frequencies: ", y_test.values.tolist()[:3]," ... ", y_test.values.tolist()[-3:])
# evaluate the model performance using mse, mae, and R2
mseLS = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mseLS:.2f}")
maeLS = mean_absolute_error(y_test, y_pred)
print(f"Mean absolute error: {maeLS:.2f}")
r2LS = r2_score(y_test, y_pred)
print(f"R2 score: {r2LS:.2f}")
Predicted frequencies: [0.00373597 0.00330509 0.00363881 ... 0.00343745 0.00356127 0.00316796] Actual frequencies: [[0.0014319057233271762], [0.0014319057233271762], [0.0014319057233271762]] ... [[0.002147858584990764], [0.0028638114466543524], [0.0007159528616635881]] Mean squared error: 0.00 Mean absolute error: 0.00 R2 score: -0.05
# plot the actual vs predicted values for the model
plt.figure(figsize=(6, 5))
# plot the histogram of actual and predicted values
plt.hist(y_test, label="actual", bins=20)
plt.hist(y_pred, label="predicted", bins=20)
# create an array of x-coordinates for the line graph
x = np.linspace(min(y_pred), max(y_pred), len(y_pred))
# plot the predicted values as a line graph
plt.plot(x, y_pred, color="r", marker=".", linestyle=":", label="predicted line")
# set title and formatting
plt.title("Actual vs Predicted Frequency of UFO Sightings")
plt.legend(loc="upper right")
plt.xlabel("Frequency")
plt.ylabel("Count")
plt.savefig("plots/ls_freq_plot.png")
plt.show()
# perform a GridSearchCV to get the best parameter values to build the model
### WARNING: This block takes a long time to run #########
# define parameter grid to test out parameter values
param_grid = {
# criterion used in previous GridSearch: 'friedman_mse', 'squared_error', 'poisson'
"criterion": ['squared_error', 'poisson', 'absolute_error'],
# max_depth used in previous GridSearch: : 2, 3, 4, 5
"max_depth": [4, 5, 6],
"min_samples_split": [2,3]
}
# create a GridSearchCV object with DecisionTreeRegressor and the parameters grid settings
grid_search = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5, scoring="neg_mean_squared_error")
# fit the grid search on the training data
grid_search.fit(X_train, y_train)
# print out best parameters and the best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", -grid_search.best_score_)
# predict values for the testing data using the best estimator
y_pred = grid_search.best_estimator_.predict(X_test)
# evaluate the performance
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error:", mse)
Best parameters: {'criterion': 'poisson', 'max_depth': 6, 'min_samples_split': 2}
Best score: 2.606312785319421e-05
Mean squared error: 1.4590401233272634e-05
# create and fit the DecisionTreeRegressor model
modelDT = DecisionTreeRegressor(max_depth=6, criterion="poisson", min_samples_split=3, random_state=42)
modelDT.fit(X_train, y_train)
# make predictions on the testing data and print predictions and actual values
y_pred = modelDT.predict(X_test)
dt_pred = y_pred # for analysis later
print("Predicted frequencies: ", y_pred)
print("Actual frequencies: ", y_test.values.tolist()[:3]," ... ", y_test.values.tolist()[-3:])
# evaluate the model performance using mse, mae, and R2
mseDT = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mseDT:.2f}")
maeDT = mean_absolute_error(y_test, y_pred)
print(f"Mean absolute error: {maeDT:.2f}")
r2DT = r2_score(y_test, y_pred)
print(f"R2 score: {r2DT:.2f}")
Predicted frequencies: [0.0052176 0.00318307 0.0052176 ... 0.00095896 0.0052176 0.00165912] Actual frequencies: [[0.0014319057233271762], [0.0014319057233271762], [0.0014319057233271762]] ... [[0.002147858584990764], [0.0028638114466543524], [0.0007159528616635881]] Mean squared error: 0.00 Mean absolute error: 0.00 R2 score: -0.18
# plot the actual vs predicted values for the model
plt.figure(figsize=(6, 5))
# plot the histogram of actual and predicted values
plt.hist(y_test, label="actual", bins=20)
plt.hist(y_pred, label="predicted", bins=20)
# create an array of x-coordinates for the line graph
x = np.linspace(min(y_pred), max(y_pred), len(y_pred))
# plot the predicted values as a line graph
plt.plot(x, y_pred, color="r", marker=".", linestyle=":", label="predicted line")
# set title and formatting
plt.title("Actual vs Predicted Frequency of UFO Sightings")
plt.legend(loc="upper right")
plt.xlabel("Frequency")
plt.ylabel("Count")
plt.savefig("plots/dt_freq_plot.png")
plt.show()
# perform a GridSearchCV to get the best parameter values to build the model
### WARNING: This block takes a long time to run #########
# define parameter grid to test out parameter values
param_grid = {
"n_estimators": [200,250], # tried 300,400,500,600 in previous runs
"max_features": ["sqrt", "log2"],
"min_samples_split": [2,4,3]
}
# create a GridSearchCV object with RandomForestRegressor and the parameter grid
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5, scoring="neg_mean_squared_error")
# fit the grid search on the training data
grid_search.fit(X_train, y_train)
# print the best parameters and the best score
print("Best parameters: ", grid_search.best_params_)
print("Best score: ", -grid_search.best_score_)
# make predictions on the testing data using the best estimator
y_pred = grid_search.best_estimator_.predict(X_test)
# evaluate the performance
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error:", mse)
Best parameters: {'max_features': 'sqrt', 'min_samples_split': 4, 'n_estimators': 250}
Best score: 1.4481423148168945e-05
Mean squared error: 1.544495220214413e-05
# create a random forest regressor model
modelRF = RandomForestRegressor(n_estimators=250, max_features="sqrt", min_samples_split=4, random_state=42)
# fit the model to the train data
modelRF.fit(X_train, y_train)
# make predictions and print out results
y_pred = modelRF.predict(X_test)
rf_pred = y_pred # for analysis later
print("Predicted frequencies: ", y_pred)
print("Actual frequencies: ", y_test.values.tolist()[:3]," ... ", y_test.values.tolist()[-3:])
# evaluate the model performance using mse, mae, and R2
mseRF = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mseRF:.2f}")
maeRF = mean_absolute_error(y_test, y_pred)
print(f"Mean absolute error: {maeRF:.2f}")
r2RF = r2_score(y_test, y_pred)
print(f"R2 score: {r2RF:.2f}")
Predicted frequencies: [0.00424877 0.00214388 0.00091441 ... 0.00096076 0.00283202 0.00082761] Actual frequencies: [[0.0014319057233271762], [0.0014319057233271762], [0.0014319057233271762]] ... [[0.002147858584990764], [0.0028638114466543524], [0.0007159528616635881]] Mean squared error: 0.00 Mean absolute error: 0.00 R2 score: -0.25
# plot the actual vs predicted values for the model
plt.figure(figsize=(6, 5))
# plot the histogram of actual and predicted values
plt.hist(y_test, label="actual", bins=20)
plt.hist(y_pred, label="predicted", bins=20)
# create an array of x-coordinates for the line graph
x = np.linspace(min(y_pred), max(y_pred), len(y_pred))
# plot the predicted values as a line graph
plt.plot(x, y_pred, color="r", marker=".", linestyle=":", label="predicted line")
# set title and formatting
plt.title("Actual vs Predicted Frequency of UFO Sightings")
plt.legend(loc="upper right")
plt.xlabel("Frequency")
plt.ylabel("Count")
plt.savefig("plots/rf_freq_plot.png")
plt.show()
# use a GridSearchCV to find the best parameters for HistGradientBoostingRegressor
#### WARNING: this block take a bit long to run (3 - 4 min)
from sklearn.ensemble import HistGradientBoostingRegressor
# choose the parameters for the selector via a dictionary
parameters = {
"loss":["squared_error","absolute_error","gamma","poisson"],
"learning_rate":[0.005,0.1,0.5,0.8],
"max_iter":[100,125,150],
"max_leaf_nodes":[31,35,40],
}
# create an instance of GridSearchCV with GradientBoostingRegressor as estimator
grid = GridSearchCV(HistGradientBoostingRegressor(),parameters,scoring="r2",cv=5,n_jobs=-1)
# fit the grid search and display the best parameters' scores
grid.fit(X_train,y_train)
print(grid.best_params_, grid.best_score_)
# make predictions on the testing data using the best estimator
#y_pred = grid_search.best_estimator_.predict(X_test)
y_pred = grid.best_estimator_.predict(X_test)
# evaluate the performance
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error:", mse)
{'learning_rate': 0.1, 'loss': 'poisson', 'max_iter': 150, 'max_leaf_nodes': 31} 0.39654437569507317
Mean squared error: 1.4028567552982727e-05
# create HistGradientBoostingRegressor model
# create an instance of the HGBRegressor
modelhgbr = HistGradientBoostingRegressor(loss="poisson", learning_rate=0.1, max_iter=150, max_leaf_nodes=35)
# fit the classifier on the dataset
modelhgbr.fit(X_train, y_train)
# predict the frequency of UFO sightings for the data
y_pred = modelhgbr.predict(X_test)
hgbr_pred = y_pred # for analysis later
print("Predicted frequencies: ", y_pred)
print("Actual frequencies: ", y_test.values.tolist()[:3]," ... ", y_test.values.tolist()[-3:])
# evaluate the model performance using mse, mae, and R2
mseHGBR = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mseHGBR:.2f}")
maeHGBR = mean_absolute_error(y_test, y_pred)
print(f"Mean absolute error: {maeHGBR:.2f}")
r2HGBR = r2_score(y_test, y_pred)
print(f"R2 score: {r2HGBR:.2f}")
Predicted frequencies: [0.00309374 0.00939766 0.00268873 ... 0.00097707 0.00657849 0.00101086] Actual frequencies: [[0.0014319057233271762], [0.0014319057233271762], [0.0014319057233271762]] ... [[0.002147858584990764], [0.0028638114466543524], [0.0007159528616635881]] Mean squared error: 0.00 Mean absolute error: 0.00 R2 score: -0.17
# plot the actual vs predicted values for the model
plt.figure(figsize=(6, 5))
# plot the histogram of actual and predicted values
plt.hist(y_test, label="actual", bins=20)
plt.hist(y_pred, label="predicted", bins=20)
# create an array of x-coordinates for the line graph
x = np.linspace(min(y_pred), max(y_pred), len(y_pred))
# plot the predicted values as a line graph
plt.plot(x, y_pred, color="r", marker=".", linestyle=":", label="predicted line")
# set title and formatting
plt.title("Actual vs Predicted Frequency of UFO Sightings")
plt.legend(loc="upper right")
plt.xlabel("Frequency")
plt.ylabel("Count")
plt.savefig("plots/hgbr_freq_plot.png")
plt.show()
actual
| freq | |
|---|---|
| date_time | |
| 2021-01-01T00:00:00 | 0.001432 |
| 2021-01-01T00:00:00 | 0.001432 |
| 2021-01-01T00:00:00 | 0.001432 |
| 2021-01-01T00:00:00 | 0.010739 |
| 2021-01-01T00:00:00 | 0.003580 |
| ... | ... |
| 2022-12-21T07:55:00 | 0.000716 |
| 2022-12-21T21:33:00 | 0.014319 |
| 2022-12-22T00:00:00 | 0.002148 |
| 2022-12-22T09:29:00 | 0.002864 |
| 2022-12-22T10:50:00 | 0.000716 |
7532 rows × 1 columns
# create a list of regression models to compare
models = [LinearRegression(fit_intercept=True, positive=True),
Lasso(random_state=42),
DecisionTreeRegressor(max_depth=6, criterion="poisson", min_samples_split=2, random_state=42),
RandomForestRegressor(n_estimators=250, max_features="sqrt", min_samples_split=4, random_state=42),
HistGradientBoostingRegressor(loss="poisson", learning_rate=0.1, max_iter=150, max_leaf_nodes=35)
]
# define the cross-validation method to use
cv = KFold(n_splits=5, random_state=42, shuffle=True)
# define the scoring metrics to use
scoring = ["r2", "neg_mean_squared_error", "neg_mean_absolute_error"]
# make a dictionary to hold metric data
metric_data = {"model": ["LinearRegression", "Lasso", "DecisionTreeReg", "RandForestReg", "HGBReg"],
"neg_mean_mse": [None, None, None, None, None],
"neg_mean_mae": [None, None, None, None, None],
"R2_mean": [None, None, None, None, None]}
# make a metric dataframe from metric_data
metric_df = pd.DataFrame(metric_data)
# loop through the models and compare their performance
for i, model in enumerate(models):
# use cross_validate function to evaluate the model
scores = cross_validate(model, X, y, scoring=scoring, cv=cv, n_jobs=-1)
# store the metrics in the metric_df columns
metric_df.loc[i, "neg_mean_mse"] = -round(scores["test_neg_mean_squared_error"].mean(), 2)
metric_df.loc[i, "neg_mean_mae"] = -round(scores["test_neg_mean_absolute_error"].mean(), 2)
metric_df.loc[i, "R2_mean"] = round(scores["test_r2"].mean(), 2)
# print the model name and the mean test scores for each metric
print(model.__class__.__name__, "\nneg_mean_squared_error: ", -scores["test_neg_mean_squared_error"].mean(), "\nneg_mean_absolute_error: ", -scores["test_neg_mean_absolute_error"].mean(), "\nr2: ", scores["test_r2"].mean())
LinearRegression neg_mean_squared_error: 2.729417123189244e-05 neg_mean_absolute_error: 0.0031242153150519572 r2: 0.026416819493445053 Lasso neg_mean_squared_error: 2.7954094324844322e-05 neg_mean_absolute_error: 0.003212406159237985 r2: 0.0028642867460269227 DecisionTreeRegressor neg_mean_squared_error: 2.4204183947540445e-05 neg_mean_absolute_error: 0.002899444277866533 r2: 0.13659597578574648 RandomForestRegressor neg_mean_squared_error: 1.059737238234838e-06 neg_mean_absolute_error: 0.00042554435175590194 r2: 0.9622090008690005 HistGradientBoostingRegressor neg_mean_squared_error: 9.259131650152176e-06 neg_mean_absolute_error: 0.0019218633111557075 r2: 0.6696208682734115
metric_df.to_csv("data/reg_models_metrics.csv")
print("Regressor Model Metric Table")
print(metric_df)
Regressor Model Metric Table
model neg_mean_mse neg_mean_mae R2_mean
0 LinearRegression 0.0 0.0 0.03
1 Lasso 0.0 0.0 0.0
2 DecisionTreeReg 0.0 0.0 0.14
3 RandForestReg 0.0 0.0 0.96
4 HGBReg 0.0 0.0 0.67
# create a dataframe to display predictions of each model
# make data as a dictionary of each model's prediction values
data = {"lr_pred": lr_pred.ravel(), "ls_pred": ls_pred.ravel(), "dt_pred": dt_pred.ravel(), "rf_pred": rf_pred.ravel(), "hgbr_pred": hgbr_pred.ravel(), "actual": actual.to_numpy().ravel()}
# make the column names the names of the models
columns = ["lr_pred", "ls_pred", "dt_pred", "rf_pred", "hgbr_pred", "actual"]
preds_df = pd.DataFrame(data=data, columns=columns)
print("\tRegressor Model Predictions vs Actual Table")
preds_df
Regressor Model Predictions vs Actual Table
| lr_pred | ls_pred | dt_pred | rf_pred | hgbr_pred | actual | |
|---|---|---|---|---|---|---|
| 0 | 0.004990 | 0.003736 | 0.005218 | 0.004249 | 0.003094 | 0.001432 |
| 1 | 0.004159 | 0.003305 | 0.003183 | 0.002144 | 0.009398 | 0.001432 |
| 2 | 0.004803 | 0.003639 | 0.005218 | 0.000914 | 0.002689 | 0.001432 |
| 3 | 0.004749 | 0.003611 | 0.005218 | 0.022910 | 0.018813 | 0.010739 |
| 4 | 0.004824 | 0.003650 | 0.005218 | 0.002790 | 0.003444 | 0.003580 |
| ... | ... | ... | ... | ... | ... | ... |
| 7527 | 0.002424 | 0.003338 | 0.001851 | 0.000735 | 0.001753 | 0.000716 |
| 7528 | 0.004178 | 0.003280 | 0.004132 | 0.007875 | 0.006427 | 0.014319 |
| 7529 | 0.004149 | 0.003437 | 0.000959 | 0.000961 | 0.000977 | 0.002148 |
| 7530 | 0.004720 | 0.003561 | 0.005218 | 0.002832 | 0.006578 | 0.002864 |
| 7531 | 0.004011 | 0.003168 | 0.001659 | 0.000828 | 0.001011 | 0.000716 |
7532 rows × 6 columns
# subset the ufo_reset_df dataframe to features that will help in interpreting predictions
preds_cities = ufo_reset_df[["date_time","year","city","state","country","city_latitude","city_longitude","freq"]]
# reset the index numbering of preds_cities
preds_cities = preds_cities.reset_index(drop=True)
# filter out years that are not in the target set
preds_cities = preds_cities.loc[preds_cities["year"] > 2020]
preds_cities
| date_time | year | city | state | country | city_latitude | city_longitude | freq | |
|---|---|---|---|---|---|---|---|---|
| 131883 | 2021-01-01 00:00:00 | 2021 | Tempe | AZ | USA | 33.397445 | -111.924685 | 0.001432 |
| 131884 | 2021-01-01 00:00:00 | 2021 | Deltona | FL | USA | 28.904056 | -81.238074 | 0.001432 |
| 131885 | 2021-01-01 00:00:00 | 2021 | Rapid City | SD | USA | 44.072001 | -103.255555 | 0.001432 |
| 131886 | 2021-01-01 00:00:00 | 2021 | Phoenix | AZ | USA | 33.503786 | -112.077080 | 0.010739 |
| 131887 | 2021-01-01 00:00:00 | 2021 | Riverview | FL | USA | 27.848577 | -82.321746 | 0.003580 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 139669 | 2022-12-21 21:33:00 | 2022 | Columbus | OH | USA | 39.993828 | -82.982484 | 0.014319 |
| 139670 | 2022-12-22 00:00:00 | 2022 | NaN | NC | USA | NaN | NaN | NaN |
| 139671 | 2022-12-22 00:00:00 | 2022 | Johannesburg | GAUTENG | South Africa | NaN | NaN | 0.002148 |
| 139672 | 2022-12-22 09:29:00 | 2022 | Newark | DE | USA | 39.670280 | -75.720711 | 0.002864 |
| 139673 | 2022-12-22 10:50:00 | 2022 | Ahmedabad | nan | Unknown | NaN | NaN | 0.000716 |
7791 rows × 8 columns
# filter out the missing values from city and state columns with a mask
# set mask to cities that are NA
mask = preds_cities["city"].isna()
# assign preds_cities to the cities that are not NA with inverse operator
preds_cities = preds_cities[~mask]
# set mask to states that are NA
mask = preds_cities["state"].isna()
# assign preds_cities to the states that are not NA with inverse operator
preds_cities = preds_cities[~mask]
preds_cities
| date_time | year | city | state | country | city_latitude | city_longitude | freq | |
|---|---|---|---|---|---|---|---|---|
| 131883 | 2021-01-01 00:00:00 | 2021 | Tempe | AZ | USA | 33.397445 | -111.924685 | 0.001432 |
| 131884 | 2021-01-01 00:00:00 | 2021 | Deltona | FL | USA | 28.904056 | -81.238074 | 0.001432 |
| 131885 | 2021-01-01 00:00:00 | 2021 | Rapid City | SD | USA | 44.072001 | -103.255555 | 0.001432 |
| 131886 | 2021-01-01 00:00:00 | 2021 | Phoenix | AZ | USA | 33.503786 | -112.077080 | 0.010739 |
| 131887 | 2021-01-01 00:00:00 | 2021 | Riverview | FL | USA | 27.848577 | -82.321746 | 0.003580 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 139667 | 2022-12-21 07:55:00 | 2022 | Essex | ON | Canada | 42.173600 | -82.829900 | 0.000716 |
| 139669 | 2022-12-21 21:33:00 | 2022 | Columbus | OH | USA | 39.993828 | -82.982484 | 0.014319 |
| 139671 | 2022-12-22 00:00:00 | 2022 | Johannesburg | GAUTENG | South Africa | NaN | NaN | 0.002148 |
| 139672 | 2022-12-22 09:29:00 | 2022 | Newark | DE | USA | 39.670280 | -75.720711 | 0.002864 |
| 139673 | 2022-12-22 10:50:00 | 2022 | Ahmedabad | nan | Unknown | NaN | NaN | 0.000716 |
7532 rows × 8 columns
# assign important features or columns of preds_df to preds_cities
preds_cities["actual"] = preds_df["actual"].values
preds_cities["rf_pred"] = preds_df["rf_pred"].values
preds_cities
| date_time | year | city | state | country | city_latitude | city_longitude | freq | actual | rf_pred | |
|---|---|---|---|---|---|---|---|---|---|---|
| 131883 | 2021-01-01 00:00:00 | 2021 | Tempe | AZ | USA | 33.397445 | -111.924685 | 0.001432 | 0.001432 | 0.004249 |
| 131884 | 2021-01-01 00:00:00 | 2021 | Deltona | FL | USA | 28.904056 | -81.238074 | 0.001432 | 0.001432 | 0.002144 |
| 131885 | 2021-01-01 00:00:00 | 2021 | Rapid City | SD | USA | 44.072001 | -103.255555 | 0.001432 | 0.001432 | 0.000914 |
| 131886 | 2021-01-01 00:00:00 | 2021 | Phoenix | AZ | USA | 33.503786 | -112.077080 | 0.010739 | 0.010739 | 0.022910 |
| 131887 | 2021-01-01 00:00:00 | 2021 | Riverview | FL | USA | 27.848577 | -82.321746 | 0.003580 | 0.003580 | 0.002790 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 139667 | 2022-12-21 07:55:00 | 2022 | Essex | ON | Canada | 42.173600 | -82.829900 | 0.000716 | 0.000716 | 0.000735 |
| 139669 | 2022-12-21 21:33:00 | 2022 | Columbus | OH | USA | 39.993828 | -82.982484 | 0.014319 | 0.014319 | 0.007875 |
| 139671 | 2022-12-22 00:00:00 | 2022 | Johannesburg | GAUTENG | South Africa | NaN | NaN | 0.002148 | 0.002148 | 0.000961 |
| 139672 | 2022-12-22 09:29:00 | 2022 | Newark | DE | USA | 39.670280 | -75.720711 | 0.002864 | 0.002864 | 0.002832 |
| 139673 | 2022-12-22 10:50:00 | 2022 | Ahmedabad | nan | Unknown | NaN | NaN | 0.000716 | 0.000716 | 0.000828 |
7532 rows × 10 columns
# verify to check if freq values from preds_cities["freq"] is equal to preds_cities["actual"]
flag = 0
# if any of the freq values are not equal, flag would be 1
if not (preds_cities["freq"] == preds_cities["actual"]).any():
flag += 1
print(flag)
0
# drop freq column that is redundant
preds_cities.drop(columns=["freq"], inplace=True)
# use preds_cities for mapping later
preds_cities.tail(10)
| date_time | year | city | state | country | city_latitude | city_longitude | actual | rf_pred | |
|---|---|---|---|---|---|---|---|---|---|
| 139662 | 2022-12-20 02:00:00 | 2022 | Tucson | AZ | USA | 32.250179 | -110.939324 | 0.012171 | 0.017183 |
| 139663 | 2022-12-20 02:30:00 | 2022 | Washington Dc | WA | USA | NaN | NaN | 0.002864 | 0.002410 |
| 139664 | 2022-12-20 21:58:00 | 2022 | Mentor Headlands | OH | USA | NaN | NaN | 0.000716 | 0.001460 |
| 139665 | 2022-12-20 23:30:00 | 2022 | East Greenwich | RI | USA | 41.652800 | -71.469700 | 0.001432 | 0.002815 |
| 139666 | 2022-12-21 04:11:00 | 2022 | Franklin | ME | USA | 44.611700 | -68.240300 | 0.007875 | 0.005728 |
| 139667 | 2022-12-21 07:55:00 | 2022 | Essex | ON | Canada | 42.173600 | -82.829900 | 0.000716 | 0.000735 |
| 139669 | 2022-12-21 21:33:00 | 2022 | Columbus | OH | USA | 39.993828 | -82.982484 | 0.014319 | 0.007875 |
| 139671 | 2022-12-22 00:00:00 | 2022 | Johannesburg | GAUTENG | South Africa | NaN | NaN | 0.002148 | 0.000961 |
| 139672 | 2022-12-22 09:29:00 | 2022 | Newark | DE | USA | 39.670280 | -75.720711 | 0.002864 | 0.002832 |
| 139673 | 2022-12-22 10:50:00 | 2022 | Ahmedabad | nan | Unknown | NaN | NaN | 0.000716 | 0.000828 |
preds_display = preds_cities[["date_time","year","city","state","country","actual","rf_pred"]]
print("Actual vs. Predicted Frequencies of a UFO Sighting (using RandomForestRegressor Preds)")
preds_display.tail(10)
Actual vs. Predicted Frequencies of a UFO Sighting (using RandomForestRegressor Preds)
| date_time | year | city | state | country | actual | rf_pred | |
|---|---|---|---|---|---|---|---|
| 139662 | 2022-12-20 02:00:00 | 2022 | Tucson | AZ | USA | 0.012171 | 0.017183 |
| 139663 | 2022-12-20 02:30:00 | 2022 | Washington Dc | WA | USA | 0.002864 | 0.002410 |
| 139664 | 2022-12-20 21:58:00 | 2022 | Mentor Headlands | OH | USA | 0.000716 | 0.001460 |
| 139665 | 2022-12-20 23:30:00 | 2022 | East Greenwich | RI | USA | 0.001432 | 0.002815 |
| 139666 | 2022-12-21 04:11:00 | 2022 | Franklin | ME | USA | 0.007875 | 0.005728 |
| 139667 | 2022-12-21 07:55:00 | 2022 | Essex | ON | Canada | 0.000716 | 0.000735 |
| 139669 | 2022-12-21 21:33:00 | 2022 | Columbus | OH | USA | 0.014319 | 0.007875 |
| 139671 | 2022-12-22 00:00:00 | 2022 | Johannesburg | GAUTENG | South Africa | 0.002148 | 0.000961 |
| 139672 | 2022-12-22 09:29:00 | 2022 | Newark | DE | USA | 0.002864 | 0.002832 |
| 139673 | 2022-12-22 10:50:00 | 2022 | Ahmedabad | nan | Unknown | 0.000716 | 0.000828 |
# save preds_cities dataframe as a csv file
preds_display.to_csv("data/preds_display.csv")
# apply a color gradient to the preds_cities dataframe
grad_style_df = preds_cities.style.background_gradient(cmap="Blues", axis=None)
# save the preds_cities dataframe as an HTML file
with open("data/styled_pred_df.html", "w") as file:
file.write(grad_style_df.render())
# extract the year from the date_time column
compare_df = ufo_df.copy() # Need to make copy of ufo_df of possibility of altering the working dataframe
compare_df["year"] = pd.DatetimeIndex(compare_df["date_time"]).year
compare_df2 = ufo_df.copy()
compare_df2["year"] = pd.DatetimeIndex(compare_df2["date_time"]).year
# create a list of years for comparison
yr_list = [1762, 1860, 1861, 1864, 1865, 1925, 1928, 1929, 1930, 1931, 1932, 1936, 1940, 1945, 1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022]
recent_yr_list = [1960,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022]
# filter the dataframe by the given years
compare_df_allyrs = compare_df.copy() # To make distribution graphs later
compare_df = compare_df.loc[compare_df["year"].isin(yr_list)]
compare_df2 = compare_df2.loc[compare_df2["year"].isin(recent_yr_list)]
# group by year and count sightings
all_yrs = compare_df_allyrs.groupby("year").size()
yrs_comp = compare_df.groupby("year").size()
recent_yrs_comp = compare_df2.groupby("year").size()
# reset index to count
all_yrs = all_yrs.reset_index(name="count")
yrs_comp = yrs_comp.reset_index(name="count")
recent_yrs_comp = recent_yrs_comp.reset_index(name="count")
print("Year and Counts of UFO Sightings (hoax filtered)\n", yrs_comp.head(10).astype(int).to_string(index=True))
print("\nYear and Counts of UFO Sightings (no hoax filter)\n", yrs_comp_no_fil.head(10).astype(int).to_string(index=True))
print("\nRecent Year and Counts of UFO Sightings (hoax filtered)\n", recent_yrs_comp.tail(10).astype(int).to_string(index=True))
print("\nRecent Year and Counts of UFO Sightings (no hoax filter)\n", recent_yrs_comp_no_fil.tail(10).astype(int).to_string(index=True))
Year and Counts of UFO Sightings (hoax filtered)
year count
0 1762 1
1 1860 1
2 1861 1
3 1864 2
4 1865 1
5 1925 1
6 1928 1
7 1929 2
8 1930 5
9 1931 2
Year and Counts of UFO Sightings (no hoax filter)
year count
0 1970 206
1 1975 404
2 1980 324
3 1985 274
4 1990 339
5 1995 1510
6 2000 3134
7 2005 4525
8 2010 4833
9 2011 5667
Recent Year and Counts of UFO Sightings (hoax filtered)
year count
43 2013 7812
44 2014 8647
45 2015 6867
46 2016 5621
47 2017 5061
48 2018 3471
49 2019 6374
50 2020 7360
51 2021 3486
52 2022 4305
Recent Year and Counts of UFO Sightings (no hoax filter)
year count
42 2013 7860
43 2014 8765
44 2015 6987
45 2016 5758
46 2017 5145
47 2018 3517
48 2019 6370
49 2020 7349
50 2021 3477
51 2022 4312
# plot distribution line graph of all years by count
# make a line graph of ALL years by count
all_yrs.plot(x="year", y="count", kind="line", legend=True)
plt.title("(Comprehensive) Year by Counts of UFO Sightings")
plt.legend()
plt.ylim(0, 9000)
plt.tight_layout()
# save as png file
plt.savefig("plots/all_years_line.png")
plt.show()
# plot trend line and line graph of all years by count
# make a line graph of ALL years by count
all_yrs.plot(x="year", y="count", kind="line", legend=True)
# get x and y values to plot trend line
x = all_yrs["year"]
y = all_yrs["count"]
z = np.polyfit(x,y,1) # set degree as one for linear graph
p = np.poly1d(z) # create a polynomial object
# plot the trend line
plt.plot(x, p(x), color="red", label="trend line")
plt.title("(Comprehensive) Year by Counts of UFO Sightings")
plt.legend()
plt.ylim(-1000, 9000)
plt.tight_layout()
# save as png file
plt.savefig("plots/all_years_trend_line.png")
plt.show()
# plot trend line and line graph of recent years by count
# make a line graph of year by count in recent_yr_list
recent_yrs_comp.plot(x="year", y="count", kind="line", legend=True)
# get x and y values to plot trend line
x = recent_yrs_comp["year"]
y = recent_yrs_comp["count"]
z = np.polyfit(x,y,1) # set degree as one for linear graph
p = np.poly1d(z) # create a polynomial object
# plot the trend line
plt.plot(x, p(x), color="red", label="trend line")
plt.title("Recent Year by Counts of UFO Sightings")
plt.legend()
plt.ylim(-1000, 9000)
plt.tight_layout()
# save as png file
plt.savefig("plots/recent_years_trend_line.png")
plt.show()
# make a bar graph of year by count
yrs_comp.plot(x="year", y="count", kind="bar")
plt.title("Bar Graph of Year by Count of UFO Sightings")
plt.tight_layout()
plt.savefig("plots/years_bar.png")
plt.show()
# testing
#len(recent_yr_list)
# verify year count with selection of the dataframe and shape function
compare_df = ufo_df.copy()
compare_df["year"] = pd.DatetimeIndex(compare_df["date_time"]).year
compare_df.loc[compare_df["year"] == 1970].shape[0]
# verify year count with selection of the dataframe and shape function
compare_df["year"] = pd.DatetimeIndex(compare_df["date_time"]).year
compare_df.loc[compare_df["year"] == 2013].shape[0]
7812
# make bar graph for years > 1930 and counts
fig, ax = plt.subplots(figsize=(9,5))
# get x and y values from yrs_comp dataframe
x = yrs_comp["year"]
y = yrs_comp["count"]
# make a bar plot with the x and y values
bar_container = ax.bar(x, y)
# set the labels and title
ax.set(xlabel="year", ylabel='count', title="Years > 1930 by Counts of UFO Sightings")
# format the bar_label object
ax.bar_label(bar_container, fontsize=8, rotation=90, padding=10)
#plt.xticks(yr_list[:7],yr_list[:7])
plt.xlim(1927,2023) # set the x-axis limit
plt.ylim(0,10000) # set the y-axis limit
plt.tight_layout()
plt.show()
# make bar graph for recent years (1970 and on) and counts
fig, ax = plt.subplots(figsize=(9,5))
# get x and y values from yrs_comp dataframe
x = recent_yrs_comp["year"]
y = recent_yrs_comp["count"]
# make a bar plot with the x and y values
bar_container = ax.bar(x, y)
# set the labels and title
ax.set(xlabel="year", ylabel='count', title="Recent Year by Counts of UFO Sightings", ylim=(0, 9500))
# format the bar_label object
ax.bar_label(bar_container, fontsize=10, rotation=90, padding=15)
plt.xlim(1969, 2023) # set the x-axis limit
plt.ylim(0, 10500) # set the y-axis limit
plt.tight_layout()
# save as png file
plt.savefig("plots/recent_years_barplot.png")
plt.show()
# copy of ufo_df to compare_df
compare_df = ufo_df.copy()
# convert the date_time column to datetime format
compare_df["date_time"] = pd.to_datetime(compare_df["date_time"])
# extract the year from the date_time column and store it in a new column
compare_df["year"] = compare_df["date_time"].dt.year
# group by year and count the number of sightings
top_years = compare_df.groupby("year").size()
# sort the series in descending order and select the top years
top_years = top_years.sort_values(ascending=False).head(15)
# reset the index of top_years to count
top_years = top_years.reset_index(name="count")
#top_years = top_years.astype(int)
# print the years with count table
print("Years with Highest Counts of UFO Sightings\n", top_years.to_string(index=False))
# make bar graph with subplot size formatting
fig, ax = plt.subplots(figsize=(7,4))
# assign years to x for bar values
x = top_years["year"]
# assign count to y
y = top_years["count"]
# make a bar chart with the x and y values
# subtract 0.4 from x values
bar_container = ax.bar(x, y)
# set the labels and title
ax.set(xlabel="year", ylabel="count", title="Years with Highest Counts of UFO Sightings")
# format the bar_label object
ax.bar_label(bar_container, fontsize=8, rotation=90, padding=8)
# set the xticks and xticklabels
plt.xticks(x, top_years["year"], rotation=90,)
plt.ylim(0, 10000) # set the y-axis limit
# save as png file
plt.savefig("plots/top_years_barplot.png")
plt.show()
Years with Highest Counts of UFO Sightings year count 2014 8647 2012 8133 2013 7812 2020 7360 2015 6867 2019 6374 2016 5621 2011 5585 2008 5187 2017 5061 2009 4943 2010 4753 2004 4753 2007 4652 2005 4504
# get the month from the date_time column
compare_df = ufo_df.copy()
compare_df["month"] = pd.DatetimeIndex(compare_df["date_time"]).month
# group by month and count sightings
sightings = compare_df.groupby("month").size()
# sort by count in descending order
sightings = sightings.sort_values(ascending=False)
# select the first 4 rows
sightings = sightings.head(4)
# reset index to get a dataframe
sightings = sightings.reset_index(name="count")
print("Top 4 Months with Highest Counts of UFO Sightings\n", sightings.astype(int).to_string(index=False))
Top 4 Months with Highest Counts of UFO Sightings
month count
7 16302
8 14607
6 13710
9 13417
# plot a bar chart for the counts by month of UFO sightings
compare_df = ufo_df.copy()
# get the month from the date_time column
compare_df["month"] = pd.DatetimeIndex(compare_df["date_time"]).month
# make a bar graph of count of sightings by month
month_count = compare_df.groupby("month").size()
# reset the index to count to get a dataframe
month_count = month_count.reset_index(name="count")
# make a month list for xticks markers
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
# make bar graph with subplot size formatting
fig, ax = plt.subplots(figsize=(6,5))
# get x and y values from yrs_comp dataframe
x = month_count["month"]
y = month_count["count"]
# make a bar chart with the x and y values
bar_container = ax.bar(x, y)
# set the labels and title
ax.set(xlabel="month", ylabel="count", title="Counts by Month of UFO Sightings")
# format the bar_label object
ax.bar_label(bar_container, fontsize=9, rotation=90, padding=10)
# set the xticks and xticklabels
plt.xticks(range(1, 13), months)
plt.ylim(0, 19000) # set the y-axis limit
plt.tight_layout()
# Save the figure as a PNG file named 'line_plot.png' in your current directory
plt.savefig("plots/month_barplot.png")
plt.show()
# testing
# Verify month count with selection of the dataframe and shape function
compare_df["month"] = pd.DatetimeIndex(compare_df["date_time"]).month
compare_df.loc[compare_df["month"] == 7].shape[0]
# Verify month count with selection of the dataframe and shape function
#compare_df["month"] = pd.DatetimeIndex(compare_df["date_time"]).month
#compare_df.loc[compare_df["month"] == 9].shape[0]
16302
compare_df = ufo_df.copy() # Need to make copy of ufo_df because this block may alter the actual dataframe
# filter the dataframe by country not equal to USA
compare_df = compare_df.loc[compare_df["country"] != "USA"]
# group by city, state, and location; count sightings for each group
sightings = compare_df.groupby("city").size()
sightings2 = compare_df.groupby("state").size()
sightings3 = compare_df.groupby("country").size()
# sort by counts in descending order
sightings = sightings.sort_values(ascending=False)
sightings2 = sightings2.sort_values(ascending=False)
sightings3 = sightings3.sort_values(ascending=False)
# select top 10 values
sightings = sightings.head(n=10)
sightings2 = sightings2.head(n=10)
sightings3 = sightings3.head(n=10)
# reset index to get a dataframe
sightings = sightings.reset_index(name="count")
sightings2 = sightings2.reset_index(name="count")
sightings3 = sightings3.reset_index(name="count")
# add 1 to the new index to make list start at 1
sightings.index = sightings.index + 1
sightings2.index = sightings2.index + 1
sightings3.index = sightings3.index + 1
# print the result
print("Top 10 (Non-USA) Cities with Highest Numbers of Sightings\n", sightings)
print("\nTop 10 (Non-USA) States with Highest Numbers of Sightings\n", sightings2)
print("\nTop 10 (Non-USA) Countries with Highest Numbers of Sightings\n",sightings3)
Top 10 (Non-USA) Cities with Highest Numbers of Sightings
city count
1 London 326
2 Toronto 265
3 Calgary 146
4 Vancouver 136
5 Ottawa 132
6 Edmonton 129
7 Winnipeg 106
8 Victoria 104
9 Montreal 104
10 Manchester 83
Top 10 (Non-USA) States with Highest Numbers of Sightings
state count
1 nan 5784
2 ON 2391
3 BC 1145
4 AB 657
5 QC 428
6 MB 238
7 SK 224
8 NS 213
9 Ireland 212
10 NB 195
Top 10 (Non-USA) Countries with Highest Numbers of Sightings
country count
1 Canada 5649
2 Australia 867
3 India 439
4 Mexico 387
5 United Kingdom 236
6 New Zealand 195
7 South Africa 193
8 Germany 193
9 Puerto Rico 181
10 Unknown 170
compare_df = ufo_df.copy() # Need to make copy of ufo_df because this block may alter the actual dataframe
# group by country
top_countries = compare_df.groupby("country").size()
# sort in descending order
top_countries = top_countries.sort_values(ascending=False)
# reset the index to count to get a dataframe
top_countries = top_countries.reset_index(name="count")
# add 1 to the index so list starts at 1
top_countries.index = top_countries.index + 1
print("Top Countries with Highest Counts of UFO Sightings\n", top_countries.head(10))
Top Countries with Highest Counts of UFO Sightings
country count
1 USA 124633
2 Canada 5649
3 Australia 867
4 India 439
5 Mexico 387
6 United Kingdom 236
7 New Zealand 195
8 Germany 193
9 South Africa 193
10 Puerto Rico 181
compare_df = ufo_df.copy() # make copy of ufo_df
# create a list of countries
countries = ["Japan", "Russia", "USA", "Australia", "New Zealand", "Brazil", "Canada", "China", "France", "Germany", "India", "Italy", "Mexico", "Philippines", "South Africa", "United Kingdom"]
# sort the countries list
countries.sort()
# filter by countries list
top_countries = compare_df.loc[compare_df["country"].isin(countries)]
# group by country and get the size
top_countries = top_countries.groupby("country").size()
# reset index to count
top_countries = top_countries.reset_index(name="count")
# plot the graph and set to appropriate format and labels
top_countries.plot(x="country", xlabel="country", y="count", ylabel="count", rot=90, kind="line", legend=True)
plt.title("Graph of Select Countries by Count of UFO Sightings")
plt.legend()
plt.xticks(range(0, 16), countries) # set the xticks and xticklabels
plt.xlim(-1,16)
plt.savefig("plots/select_countries_line.png")
plt.show()
# Bar graph of top countries by highest count
compare_df = ufo_df.copy() # Need to make copy of ufo_df because this block may alter the actual dataframe
# group by country
top_countries = compare_df.groupby("country").size()
# sort in descending order
top_countries = top_countries.sort_values(ascending=False)
# get the top countries values
top_countries = top_countries.head(10)
# reset the index to count to get a dataframe
top_countries = top_countries.reset_index(name="count")
# add 1 to the index so list starts at 1
top_countries.index = top_countries.index + 1
print("Countries with Highest Counts of UFO Sightings\n", top_countries)
# make fig and ax objects with subplot size formatting
fig, ax = plt.subplots(figsize=(8,6))
# get x and y values from yrs_comp dataframe
x = top_countries["country"]
y = top_countries["count"]
# make a bar chart with the x and y values
bar_container = ax.bar(x, y)
# set the labels and title
ax.set(xlabel="country", ylabel="count", title="Countries by Highest Count of UFO Sightings")
# format the bar_label object
ax.bar_label(bar_container, fontsize=9, rotation=90, padding=15)
# set the xticks and xticklabels
plt.xticks(range(0, 10), top_countries["country"], rotation=90)
# set the y-axis limit
plt.ylim(0, 150000)
plt.tight_layout()
plt.savefig("plots/top_countries_bar.png")
plt.show()
Countries with Highest Counts of UFO Sightings
country count
1 USA 124633
2 Canada 5649
3 Australia 867
4 India 439
5 Mexico 387
6 United Kingdom 236
7 New Zealand 195
8 Germany 193
9 South Africa 193
10 Puerto Rico 181
# copy ufo_df because this block may alter the actual dataframe
compare_df = ufo_reset_df.copy()
# group by location and count sightings
sightings_city = compare_df.groupby(["city","freq"]).size()
sightings_st = compare_df.groupby(["state","freq"]).size()
sightings_ctry = compare_df.groupby(["country","freq"]).size()
# sort by count in descending order
sightings_city = sightings_city.sort_values(ascending=False)
sightings_st = sightings_st.sort_values(ascending=False)
sightings_ctry = sightings_ctry.sort_values(ascending=False)
# select the first 10 rows
sightings_city = sightings_city.head(10)
sightings_st = sightings_st.head(10)
sightings_ctry = sightings_ctry.head(10)
# reset index to get a dataframe
sightings_city = sightings_city.reset_index(name='count')
sightings_st = sightings_st.reset_index(name='count')
sightings_ctry = sightings_ctry.reset_index(name='count')
sightings_ctry_no_fil.index = sightings_ctry_no_fil.index + 1
sightings_ctry.index = sightings_ctry.index + 1
# print the results of sightings frequencies with locations
print("Top Cities of UFO Sightings with hoax filter\n", sightings_city)
print("\nTop States of UFO Sightings with hoax filter\n", sightings_st)
print("\nTop Countries of UFO Sightings with hoax filter\n", sightings_ctry)
Top Cities of UFO Sightings with hoax filter
city freq count
0 Portland 0.025058 105
1 Tinley Park 0.066584 93
2 Phoenix 0.032218 90
3 Phoenix 0.031502 88
4 New York 0.029354 82
5 Tucson 0.017183 72
6 New York 0.024342 68
7 Charlotte 0.009307 65
8 Austin 0.009307 65
9 Seattle 0.021479 60
Top States of UFO Sightings with hoax filter
state freq count
0 CA 0.000716 5241
1 nan 0.000716 4797
2 CA 0.001432 2814
3 NY 0.000716 2584
4 PA 0.000716 2437
5 FL 0.000716 2306
6 TX 0.000716 2271
7 WA 0.000716 1944
8 MI 0.000716 1664
9 IL 0.000716 1648
Top Countries of UFO Sightings with hoax filter
country freq count
1 USA 0.000716 48056
2 USA 0.001432 20804
3 USA 0.002148 11736
4 USA 0.002864 7648
5 USA 0.003580 5640
6 USA 0.004296 4152
7 USA 0.005012 3542
8 Canada 0.000716 2893
9 USA 0.005728 2752
10 USA 0.006444 2160
# copy ufo_df because this block may alter the actual dataframe
compare_df = ufo_df.copy()
# group by location and count sightings
sightings2 = compare_df.groupby(["city"]).size()
sightings3 = compare_df.groupby(["state"]).size()
sightings4 = compare_df.groupby(["country"]).size()
# sort by count in descending order
sightings2 = sightings2.sort_values(ascending=False)
sightings3 = sightings3.sort_values(ascending=False)
sightings4 = sightings4.sort_values(ascending=False)
# select the first 10 rows
sightings2 = sightings2.head(10)
sightings3 = sightings3.head(10)
sightings4 = sightings4.head(10)
# reset index to get a dataframe
sightings2 = sightings2.reset_index(name='count')
sightings3 = sightings3.reset_index(name='count')
sightings4 = sightings4.reset_index(name='count')
sightings_ctry_no_fil.index = sightings_ctry_no_fil.index + 1
sightings4.index = sightings4.index + 1
# print the results of sightings with and without hoax filter
print_list("Top Cities of UFO Sightings without hoax filter", "city", "count", sightings_no_fil) # from previous block above, before hoax filter application to dataframe
print_list("\nTop Cities of UFO Sightings with hoax filter", "city", "count", sightings2)
print_sm_list("\nTop States of UFO Sightings without hoax filter", "state", "count", sightings_st_no_fil)
print_sm_list("\nTop States of UFO Sightings with hoax filter", "state", "count", sightings3)
print("\nTop Countries of UFO Sightings without hoax filter\n", sightings_ctry_no_fil)
print("\nTop Countries of UFO Sightings with hoax filter\n", sightings4)
Top Cities of UFO Sightings without hoax filter
city count
1. New York 873
2. Phoenix 793
3. Seattle 783
4. Las Vegas 683
5. Portland 659
6. Los Angeles 578
7. San Diego 567
8. Houston 476
9. Chicago 453
10. Tucson 451
Top Cities of UFO Sightings with hoax filter
city count
1. New York 856
2. Phoenix 787
3. Seattle 780
4. Las Vegas 678
5. Portland 652
6. Los Angeles 576
7. San Diego 564
8. Houston 473
9. Chicago 450
10. Tucson 448
Top States of UFO Sightings without hoax filter
state count
1. CA 15454
2. FL 7823
3. WA 6895
4. nan 6267
5. TX 5816
6. NY 5624
7. PA 4780
8. AZ 4727
9. OH 4271
10. IL 4111
Top States of UFO Sightings with hoax filter
state count
1. CA 15331
2. FL 7756
3. WA 6850
4. nan 6178
5. TX 5758
6. NY 5559
7. PA 4727
8. AZ 4691
9. OH 4219
10. IL 4060
Top Countries of UFO Sightings without hoax filter
country count
3 USA 125880
4 Canada 5706
5 Australia 886
6 India 445
7 Mexico 389
8 United Kingdom 238
9 Germany 200
10 New Zealand 197
11 South Africa 195
12 Puerto Rico 182
Top Countries of UFO Sightings with hoax filter
country count
1 USA 124633
2 Canada 5649
3 Australia 867
4 India 439
5 Mexico 387
6 United Kingdom 236
7 New Zealand 195
8 Germany 193
9 South Africa 193
10 Puerto Rico 181
# Need to make copy of ufo_df because this block may alter the actual dataframe
compare_df = ufo_df.copy()
# filter the dataframe by country equal to USA
compare_df = compare_df.loc[compare_df["country"] == "USA"]
# group by location and count sightings
sightings2 = compare_df.groupby(["city"]).size()
sightings3 = compare_df.groupby(["state"]).size()
# sort by count in descending order
sightings2 = sightings2.sort_values(ascending=False)
sightings3 = sightings3.sort_values(ascending=False)
# select the first 10 rows
sightings2 = sightings2.head(10)
sightings3 = sightings3.head(10)
# reset index to get a dataframe
sightings2 = sightings2.reset_index(name='count')
sightings3 = sightings3.reset_index(name='count')
# print the results of sightings with and without hoax filter
print_sm_list("Top US Cities of UFO Sightings without hoax filter", "city", "count", sightings_us_no_fil)
print_sm_list("\nTop US Cities of UFO Sightings with hoax filter", "city", "count", sightings2)
print_sm_list("\nTop US States with UFO Sightings without hoax filter", "state", "count", sightings_us_st_no_fil)
print_sm_list("\nTop US States of UFO Sightings with hoax filter", "state", "count", sightings3)
Top US Cities of UFO Sightings without hoax filter city count 1. New York 872 2. Phoenix 793 3. Seattle 783 4. Las Vegas 682 5. Portland 659 6. Los Angeles 578 7. San Diego 565 8. Chicago 452 9. Tucson 451 10. Houston 447 Top US Cities of UFO Sightings with hoax filter city count 1. New York 856 2. Phoenix 787 3. Seattle 780 4. Las Vegas 677 5. Portland 652 6. Los Angeles 576 7. San Diego 562 8. Chicago 449 9. Tucson 448 10. Houston 444 Top US States with UFO Sightings without hoax filter state count 1. CA 15446 2. FL 7815 3. WA 6893 4. TX 5813 5. NY 5619 6. PA 4775 7. AZ 4726 8. OH 4265 9. IL 4108 10. MI 3489 Top US States of UFO Sightings with hoax filter state count 1. CA 15323 2. FL 7749 3. WA 6848 4. TX 5755 5. NY 5555 6. PA 4724 7. AZ 4690 8. OH 4214 9. IL 4057 10. NC 3452
# create a bar chart for 10 cities with highest counts of UFO sightings
# make copy of ufo_df
compare_df = ufo_df.copy()
# group by location and count sightings
sightings_city = compare_df.groupby(["city"]).size()
# sort by count in descending order
sightings_city = sightings_city.sort_values(ascending=False)
# select the first 10 rows
sightings_city = sightings_city.head(10)
# reset index to get a dataframe
sightings_city = sightings_city.reset_index(name='count')
# make fig and ax objects with subplot size formatting
fig, ax = plt.subplots(figsize=(7,5))
# get x and y values
x = sightings_city["city"]
y = sightings_city["count"]
# make a bar chart with the x and y values
bar_container = ax.bar(x, y)
# set the labels and title
ax.set(xlabel="city", ylabel="count", title="Cities by Highest Count of UFO Sightings")
# format the bar_label object
ax.bar_label(bar_container, fontsize=10, rotation=90, padding=9)
# set the xticks and xticklabels
plt.xticks(range(0, 10), sightings_city["city"], rotation=90)
# set the y-axis limit
plt.ylim(0, 1000)
plt.tight_layout()
# save as png file
plt.savefig("plots/cities_counts_bar.png")
plt.show()
# make copy of ufo_df
compare_df = ufo_df.copy()
# filter the dataframe by country equal to USA
compare_df = compare_df.loc[compare_df["country"] == "USA"]
# group by location and count sightings
sightings_st = compare_df.groupby(["state"]).size()
# sort by count in descending order
sightings_st = sightings_st.sort_values(ascending=False)
# select the first 10 rows
sightings_st = sightings_st.head(10)
# reset index to get a dataframe
sightings_st = sightings_st.reset_index(name='count')
# make fig and ax objects with subplot size formatting
fig, ax = plt.subplots(figsize=(7,5))
# get x and y values
x = sightings_st["state"]
y = sightings_st["count"]
# make a bar chart with the x and y values
bar_container = ax.bar(x, y)
# set the labels and title
ax.set(xlabel="state", ylabel="count", title="States by Highest Count of UFO Sightings")
# format the bar_label object
ax.bar_label(bar_container, fontsize=10, rotation=90, padding=9)
# set the xticks and xticklabels
plt.xticks(range(0, 10), sightings_st["state"], rotation=90)
# set the y-axis limit
plt.ylim(0, 18500)
plt.tight_layout()
# save as png file
plt.savefig("plots/states_counts_bar.png")
plt.show()
#import plotly.express as px
# make copy of ufo_df
compare_df = ufo_reset_df.copy()
# subset the ufo_cities with columns needed
compare_df = compare_df[["year","city","state","country"]]
# filter the dataframe by country equal to USA
compare_df = compare_df.loc[compare_df["country"] == "USA"]
# calculate the count of the state column and assign it to state_ct
compare_df["state_count"] = compare_df.groupby(["state"])["country"].transform("count")
# reduce observations by limiting the years == 2022
compare_df = compare_df.loc[compare_df["year"] > 2020]
# drop missing values
compare_df = compare_df.dropna()
# make heatmap using plotly
fig = px.choropleth(compare_df, locations='state', locationmode='USA-states',
color='state_count', color_continuous_scale='Reds',
scope='usa', title='US State Count/Frequency of UAP Sightings')
fig.show()
fig.write_html("maps/state_heatmap.html")
compare_df = ufo_df.copy()
# group by location and count sightings
sightings2 = compare_df.groupby("country").size()
sightings3 = compare_df.groupby("country").size()
# sort by count in descending order
sightings2 = sightings2.sort_values(ascending=False)
# sort by count in ascending order
sightings3 = sightings3.sort_values(ascending=True)
# select the first 10 rows
sightings2 = sightings2.head(10)
sightings3 = sightings3.head(10)
# reset index to get a dataframe and add 1 to index so the list will display starting at 1
sightings2 = sightings2.reset_index(name='count')
sightings2.index = sightings2.index + 1
sightings3 = sightings3.reset_index(name='count')
sightings3.index = sightings3.index + 1
# print the results of sightings
print("Countries with Highest Number of UFO Sightings\n", sightings2.to_string(index=False))
print("\nCountries/Territories with Lowest Number of UFO Sightings\n", sightings3.to_string(index=False))
Countries with Highest Number of UFO Sightings
country count
USA 124633
Canada 5649
Australia 867
India 439
Mexico 387
United Kingdom 236
New Zealand 195
Germany 193
South Africa 193
Puerto Rico 181
Countries/Territories with Lowest Number of UFO Sightings
country count
Above The Pacific Ocean 1
Sinai 1
International Space Station 1
Israel - Near Petach Tikva 1
Israel\Occupied Palestine 1
Santo Domingo 1
Italy/Greece 1
San Miguel Tecuitlapa, Puebla, Mexico 1
San Juan And St. Thomas Usvi 1
Saint Maarten 1
compare_df = ufo_df.copy()
# extract the year and month from the date_time column
compare_df["year"] = pd.DatetimeIndex(compare_df["date_time"]).year
compare_df["month"] = pd.DatetimeIndex(compare_df["date_time"]).month
# get different feature groupings and display the frequencies for each group
freq_ctry = compare_df.groupby(["year", "country"]).size()
# reset index to get a dataframe with a column for frequency
freq_ctry = freq_ctry.reset_index(name="frequency")
freq_ctry["frequency"] = freq_ctry["frequency"].astype(int)
# filter by the year condition (recent years 1970 and on)
freq_ctry = freq_ctry.query("year > 1781")
# plot a line graph of frequency by year for each country
freq_ctry.plot(x="year", y="frequency", ylabel="count", legend=False, kind="line", title="Frequency of UAP Sightings in All Countries by Year")
plt.savefig("plots/comp_yr_freq_line.png")
plt.show()
# define the list of countries you want to show
countries = ["USA", "Canada", "United Kingdom", "Australia", "China", "Russia", "Japan"]
# filter the DataFrame by the countries list
freq_ctry = freq_ctry[freq_ctry["country"].isin(countries)]
# filter the DataFrame by the countries list and the year condition
#frequency1d_filtered = frequency1d.query("country in @countries and year > 1970")
# plot a line graph of frequency by year for each country
freq_ctry.plot(x="year", y="frequency", ylabel="count", kind="line", legend=False, title="Frequency of UAP Sightings for Select Countries")
plt.ylim(0, 8500)
plt.tight_layout()
# save as png file
plt.savefig("plots/recent_yr_freq_line.png")
plt.show()
# filter the dataFrame by the country and the year condition
freq_ctry = freq_ctry.query("country == \"USA\" and year > 1970")
# plot a line graph of frequency by year for each country
freq_ctry.plot(x="year", y="frequency", ylabel="count", kind="line", legend=False, title="Frequency of UAP Sightings in US")
# save as png file
plt.savefig("plots/us_freq_line.png")
plt.show()
compare_df = ufo_df.copy()
# extract the year and month from the date_time column
compare_df["year"] = pd.DatetimeIndex(compare_df["date_time"]).year
# define the list of countries you want to show
countries = ["USA", "Canada", "United Kingdom", "Australia", "China", "Russia", "Japan"]
# filter the DataFrame by the countries list
compare_df = compare_df[compare_df["country"].isin(countries)]
# get different feature groupings and display the frequencies for each group
freq_ctry = compare_df.groupby(["year", "country"]).size()
# reset index to get a dataframe with a column for frequency
freq_ctry = freq_ctry.reset_index(name="frequency")
freq_ctry["frequency"] = freq_ctry["frequency"].astype(int)
# filter by the year condition (recent years 1970 and on)
freq_ctry = freq_ctry.query("country == \"United Kingdom\" and year > 1970")
# plot a line graph of frequency by year for each country
freq_ctry.plot(x="year", y="frequency", ylabel="count", legend=False, kind="line", title="Frequency of UAP Sightings in the UK")
plt.show()
compare_df = ufo_df.copy()
# extract the year and month from the date_time column
compare_df["year"] = pd.DatetimeIndex(compare_df["date_time"]).year
# define the list of countries you want to show
countries = ["USA", "Canada", "United Kingdom", "Australia", "China", "Russia", "Japan"]
# filter the DataFrame by the countries list
compare_df = compare_df[compare_df["country"].isin(countries)]
# get different feature groupings and display the frequencies for each group
freq_ctry = compare_df.groupby(["year", "country"]).size()
# reset index to get a dataframe with a column for frequency
freq_ctry = freq_ctry.reset_index(name="frequency")
freq_ctry["frequency"] = freq_ctry["frequency"].astype(int)
# filter the DataFrame by the country and the year condition
freq_ctry = freq_ctry.query("country == \"Japan\" and year > 1970")
# plot a line graph of frequency by year for each country
freq_ctry.plot(x="year", y="frequency", ylabel="count", legend=False, kind="line", title="Frequency of UAP Sightings in Japan")
plt.show()
compare_df = ufo_df.copy()
# extract the year and month from the date_time column
compare_df["year"] = pd.DatetimeIndex(compare_df["date_time"]).year
# define the list of countries you want to show
countries = ["USA", "Canada", "United Kingdom", "Australia", "China", "Russia", "Japan"]
# filter the DataFrame by the countries list
compare_df = compare_df[compare_df["country"].isin(countries)]
# get different feature groupings and display the frequencies for each group
freq_ctry = compare_df.groupby(["year", "country"]).size()
# reset index to get a dataframe with a column for frequency
freq_ctry = freq_ctry.reset_index(name="frequency")
freq_ctry["frequency"] = freq_ctry["frequency"].astype(int)
# filter by the year condition (recent years 1970 and on)
freq_ctry = freq_ctry.query("country == \"Canada\" and year > 1970")
# plot a line graph of frequency by year for each country
freq_ctry.plot(x="year", y="frequency", ylabel="count", legend=False, kind="line", title="Frequency of UAP Sightings in Canada")
plt.show()
compare_df = ufo_df.copy() # Need to make copy of ufo_df because this block may alter the actual dataframe
# group by shape
top_shapes = compare_df.groupby("shape").size()
# sort in descending order
top_shapes = top_shapes.sort_values(ascending=False)
# get the top shape values
top_shapes = top_shapes.head(10)
# reset the index to count to get a dataframe
top_shapes = top_shapes.reset_index(name="count")
# add 1 to the index so list starts at 1
top_shapes.index = top_shapes.index + 1
# print table heading
print("Top UFO Shapes in Sightings\n", top_shapes)
# make a bar chart for the counts of UFO shapes in sightings
top_shapes = compare_df.groupby("shape").size()
top_shapes = top_shapes.sort_values(ascending=False)
top_shapes.plot(x="shape", y="count", ylabel="count", kind="bar", subplots=True, label="Counts by Shape of UFO Sightings")
# Save the figure as a PNG file named 'line_plot.png' in your current directory
plt.savefig("plots/topshapes_bar.png")
plt.show()
Top UFO Shapes in Sightings
shape count
1 light 28336
2 circle 14925
3 triangle 12458
4 fireball 9591
5 other 9552
6 unknown 9433
7 sphere 9107
8 disk 8198
9 oval 6051
10 formation 4641
compare_df = ufo_df.copy() # copy the ufo_df because this block may alter the actual dataframe
# create a list of countries
countries = ["Japan", "Russia", "USA", "Australia", "Brazil", "Canada", "China", "France", "Germany", "India", "Italy", "Mexico", "New Zealand", "Philippines", "South Africa", "United Kingdom"]
countries.sort()
# filter by country
compare_df = compare_df.loc[compare_df['country'].isin(countries)]
# group by country and find the shape with the highest count for each country
top_shapes = compare_df.groupby("country")["shape"].value_counts().groupby("country").idxmax()
# convert series to dataframe
top_shapes = top_shapes.to_frame()
# rename the column
top_shapes.columns = ["shape"]
# reset the index
top_shapes = top_shapes.reset_index()
# set country as the index of the dataframe
top_shapes = top_shapes.set_index("country")
# reindex the dataframe with the list of countries
top_shapes = top_shapes.reindex(countries)
# apply a function to extract the second element of each tuple in the shape column
top_shapes["shape"] = top_shapes["shape"].apply(lambda x: x[1])
# reset the index again
top_shapes = top_shapes.reset_index()
top_shapes.index = top_shapes.index + 1
# print the result with a tab separator
print("Most Common Shapes for Select Countries")
print(top_shapes.to_string(index=False, header=True))
# save table to file
top_shapes.to_csv("data/topshape_countries.csv")
Most Common Shapes for Select Countries
country shape
Australia light
Brazil light
Canada light
China circle
France triangle
Germany light
India circle
Italy disk
Japan light
Mexico light
New Zealand light
Philippines circle
Russia light
South Africa light
USA light
United Kingdom light
compare_df = ufo_df.copy()
# create a list of countries
countries = ["Japan", "Russia", "USA", "Australia", "New Zealand", "Brazil", "Canada", "China", "France", "Germany", "India", "Italy", "Mexico", "Philippines", "South Africa", "United Kingdom"]
countries.sort()
# filter by country
compare_df = compare_df.loc[compare_df["country"].isin(countries)]
# group by country and find the top 4 most common shapes for each country
top_shapes = compare_df.groupby("country")["shape"].value_counts().groupby("country").nlargest(3)
# rename the index levels
top_shapes.index.names = ["country","", "shape2"]
# reset the index
top_shapes = top_shapes.reset_index()
# rename the last column
top_shapes.columns = ["country", "", "shape", "count"]
# set the country and shape as the index of the series
top_shapes = top_shapes.set_index(["country", "shape"])["count"]
# print the result with a tab separator
print("3 Most Common Shapes for Select Countries")
print(top_shapes, sep="\t")
3 Most Common Shapes for Select Countries
country shape
Australia light 178
circle 95
disk 87
Brazil light 29
circle 17
triangle 14
Canada light 1243
circle 573
triangle 474
China circle 10
light 7
triangle 6
France triangle 16
light 15
other 13
Germany light 35
circle 18
oval 18
India circle 58
light 54
triangle 46
Italy disk 9
circle 8
fireball 7
Japan light 22
disk 8
other 7
Mexico light 83
disk 43
sphere 43
New Zealand light 40
circle 27
disk 16
Philippines circle 15
light 14
disk 13
Russia light 7
disk 6
other 4
South Africa light 48
disk 21
circle 20
USA light 25343
circle 13246
triangle 11247
United Kingdom light 50
circle 38
triangle 18
Name: count, dtype: int64
# copy ufo_reset_df to compare_df
compare_df = ufo_reset_df.copy()
# convert the duration column to a timedelta type
compare_df["duration"] = pd.to_timedelta(compare_df["duration"])
# get the 4 longest values in the duration column
longest = compare_df["duration"].nlargest(5)
# reset index to get a dataframe
longest = longest.reset_index(name="time")
longest.index = longest.index + 1
print("Longest Duration of UFO Sightings")
print(longest["time"])
Longest Duration of UFO Sightings 1 100 days 00:00:00 2 70 days 20:00:00 3 12 days 12:04:00 4 4 days 00:00:00 5 3 days 20:00:00 Name: time, dtype: timedelta64[ns]
compare_df = normal_df_dur.copy() # copy duration_df in case we alter it
# get the mean value of the duration column
duration_mean = compare_df["duration"].mean()
# get only the numeric value in seconds
duration_mean_seconds = duration_mean.total_seconds()
# get the mean in minutes by dividing by 60
duration_mean_min = duration_mean_seconds / 60
# alternative: use the components attribute to get the minutes component
#duration_mean_min = duration_mean.components.minutes
print("Average duration of UFO sighting in minutes: ", round(duration_mean_min))
print("Average duration of UFO sighting in seconds: ", round(duration_mean_seconds))
Average duration of UFO sighting in minutes: 23 Average duration of UFO sighting in seconds: 1401
print("Descriptive stats for city latitude and longitude of UFO sightings\n", stats_lat_long)
Descriptive stats for city latitude and longitude of UFO sightings
city_latitude city_longitude
count 115440.000000 115440.000000
mean 38.697147 -95.102130
std 5.736872 18.123691
min -31.941922 -176.627400
25% 34.223597 -112.975200
50% 39.273300 -89.081250
75% 42.352576 -80.332976
max 70.639300 130.879729
print("Range of city latitude where UFO sightings occur: ", city_lat_max, "to", city_lat_min)
print("Range of city longitude where UFO sightings occur: ", city_long_max, "to", city_long_min)
Range of city latitude where UFO sightings occur: 71 to -32 Range of city longitude where UFO sightings occur: 131 to -177
# copy ufo_reset_df to ufo_cities
ufo_cities = ufo_reset_df.copy()
# subset the ufo_cities with columns needed
ufo_cities = ufo_cities[["city","state","country","year","freq"]]
# get the top prob values
high_prob =ufo_cities.groupby(["year","city","country","freq"]).size()
# sort in descending order
high_prob = high_prob.sort_values(ascending=False)
# get the top shape values
high_prob = high_prob.head(15)
# reset the index to count to get a dataframe
high_prob = high_prob.reset_index(name="count")
# add 1 to the index so list starts at 1
high_prob.index = high_prob.index + 1
print("Top Cities with Highest Frequencies of UFO Sightings based on year, city, and country")
print(high_prob)
Top Cities with Highest Frequencies of UFO Sightings based on year, city, and country
year city country freq count
1 2004 Tinley Park USA 0.066584 93
2 2014 Portland USA 0.037946 53
3 1997 Phoenix USA 0.037230 52
4 2014 New York USA 0.037230 52
5 2012 Seattle USA 0.034366 48
6 2003 New York USA 0.034366 48
7 2001 New York USA 0.033650 47
8 2014 Myrtle Beach USA 0.033650 47
9 2015 Las Vegas USA 0.032934 46
10 2014 Las Vegas USA 0.032218 45
11 2005 Phoenix USA 0.032218 45
12 2014 Phoenix USA 0.032218 45
13 2015 Phoenix USA 0.031502 44
14 2016 Phoenix USA 0.031502 44
15 2012 Las Vegas USA 0.031502 44
from matplotlib.container import BarContainer
# make a bar chart for highest frequencies and probabilities in a year, city and country
ufo_cities = ufo_reset_df.copy()
# subset the ufo_cities with columns needed
ufo_cities = ufo_cities[["city","state","country","year","freq"]]
# get the top prob values
high_prob = ufo_cities.groupby(["year","city","country","freq"]).size()
# sort in descending order
high_prob = high_prob.sort_values(ascending=False)
# get the top shape values
high_prob = high_prob.head(15)
# reset the index to count to get a dataframe
high_prob = high_prob.reset_index(name="count")
# make fig and ax objects and set figure size
fig, ax = plt.subplots(figsize=(8,5))
# get x and y values from high_prob dataframe
x = high_prob["city"]
y = high_prob["count"]
# make a bar chart with the x and y values; for x values use np.arange() to create an array of values
bar_container = ax.bar(np.arange(15), y)
# convert bar_container object into a BarContainer object to use bar_container.patches
bar_container = BarContainer(bar_container)
# set the labels and title
ax.set(xlabel="city", ylabel="count", title="Top 15 Cities by Year & Freq's of UFO Sightings")
# subset high_prob["year","percent_prob","count"] into a list
yr_count = high_prob[["year","count","freq"]]
# make an empty list to store the labels
label_list = list()
# iterate over the rows of the yr_count dataframe
for index, row in yr_count.iterrows():
# assign the values of the year, percent_prob, and count columns from each row
year = int(row["year"])
year = str(year)
cnt = str(row["count"])
freq = str(round(row["freq"], 3))
# combine the year and count values into a single string
label = year + ": " + freq + "%\nCount: " + cnt
# append this string to the label_list
label_list.append(label)
# format the bar_label object with bar_container.patches and label_list, slicing it to 15 items
ax.bar_label(bar_container.patches, labels=label_list[:15], fontsize=9, rotation=90, padding=6)
# set the xticks and xticklabels
# use np.arange() to create even-spaced xtick marks
ax.set_xticks(np.arange(15))
# use the cities in high_prob["city"] as the xticklabels
ax.set_xticklabels(high_prob["city"], rotation=60)
# set the y-axis limit
plt.ylim(0, 140)
plt.tight_layout()
plt.savefig("plots/top_us_cities_freq_barchart.png")
plt.show()
# copy ufo_reset_df to ufo_cities
ufo_cities = ufo_reset_df.copy()
# subset the ufo_cities with columns needed
ufo_cities = ufo_cities[["city","state","country","year","freq"]]
# filter out USA from ufo_cities dataframe
ufo_cities = ufo_cities.loc[ufo_cities["country"] != "USA"]
# get the top prob values
high_prob =ufo_cities.groupby(["year","city","country","freq"]).size()
# sort in descending order
high_prob = high_prob.sort_values(ascending=False)
# get the top shape values
high_prob = high_prob.head(15)
# reset the index to count to get a dataframe
high_prob = high_prob.reset_index(name="count")
# add 1 to the index so list starts at 1
high_prob.index = high_prob.index + 1
print("Top Non-USA Cities with Highest Frequencies of UFO Sightings based on year, city, country")
print(high_prob)
Top Non-USA Cities with Highest Frequencies of UFO Sightings based on year, city, country
year city country freq count
1 2022 Gatineau Canada 0.027922 39
2 2003 Terrace Canada 0.017899 25
3 2003 Houston Canada 0.016467 23
4 2003 Vancouver Canada 0.015035 21
5 2004 Toronto Canada 0.014319 20
6 2005 Toronto Canada 0.012171 17
7 2013 Toronto Canada 0.010739 15
8 2014 Toronto Canada 0.010739 15
9 2001 Vancouver Canada 0.010023 14
10 2004 Edmonton Canada 0.010023 14
11 2007 Toronto Canada 0.010023 14
12 2020 Toronto Canada 0.009307 13
13 2013 Ottawa Canada 0.009307 13
14 2015 Edmonton Canada 0.009307 13
15 2015 Toronto Canada 0.009307 13
# make fig and ax objects and set figure size
fig, ax = plt.subplots(figsize=(8,5))
# get x and y values from high_prob dataframe
x = high_prob["city"]
y = high_prob["count"]
# make a bar chart with the x and y values; for x values use np.arange() to create an array of values
bar_container = ax.bar(np.arange(15), y)
# convert bar_container object into a BarContainer object to use bar_container.patches
bar_container = BarContainer(bar_container)
# set the labels and title
ax.set(xlabel="city", ylabel="count", title="Top 15 Cities (Non-USA) by Year, Probabilities & Freq's of UFO Sightings")
# subset high_prob["year","percent_prob","count"] into a list
yr_count = high_prob[["year","freq","count"]]
# make an empty list to store the labels
label_list = list()
# iterate over the rows of the yr_count dataframe
for index, row in yr_count.iterrows():
# assign the values of the year, percent_prob, and count columns from each row
year = int(row["year"])
year = str(year)
cnt = str(row["count"])
freq = str(round(row["freq"], 2))
# combine the year and count values into a single string
label = year + ": " + freq + "%\nCount: " + cnt
# append this string to the label_list
label_list.append(label)
# format the bar_label object with bar_container.patches and label_list, slicing it to 15 items
ax.bar_label(bar_container.patches, labels=label_list[:15], fontsize=9, rotation=90, padding=6)
# set the xticks and xticklabels
# use np.arange() to create even-spaced xtick marks
ax.set_xticks(np.arange(15))
# use the cities in high_prob["city"] as the xticklabels
ax.set_xticklabels(high_prob["city"], rotation=60)
# set the y-axis limit
plt.ylim(0, 55)
plt.savefig("plots/top_nonusa_cities_freq_barchart.png")
plt.show()
# make a map for highest frequencies and probabilities in a year, city and country
# copy preds_cities to high_prob
high_prob = preds_cities.copy()
# get the top prob values
high_prob = high_prob.groupby(["year","city","state","country","city_latitude","city_longitude","actual","rf_pred"]).size()
# drop missing values
high_prob = high_prob.dropna()
# sort in descending order
high_prob = high_prob.sort_values(ascending=False)
# get the top freq values
high_prob = high_prob.head(100)
# reset the index to count to get a dataframe
high_prob = high_prob.reset_index(name="count")
print("Cities with Highest Actual vs Predicted Frequencies & Probabilities of UFO Sightings")
# create a folium map object with Wichita's readings as the location view
map = folium.Map(location=[38, -97], zoom_start=4)
# iterate through each city to set up a popup and tooltip showing appropriate actual & pred. values
for i, row in high_prob.iterrows():
folium.Marker(
location=[row["city_latitude"], row["city_longitude"]],
popup=f"{row['city']}, {row['state']} ({row['year']})<br>actual_freq={round(row['actual'],2)}%<br>pred_freq={round(row['rf_pred'],2)}%",
tooltip=f"{row['city']} {row['year']}<br>{round(row['actual'],2)}%"
).add_to(map)
# save and display the map
map.save("maps/top_cities_freq_map.html")
map
Cities with Highest Actual vs Predicted Frequencies & Probabilities of UFO Sightings
# US Cities Map
# copy preds_cities to ufo_cities
ufo_cities = preds_cities.copy()
# filter out other countries but the USA from ufo_int
ufo_cities = ufo_cities.loc[ufo_cities["country"] == "USA"]
# subset the ufo_cities with columns needed
ufo_cities = ufo_cities[["year","city","state","country","city_latitude","city_longitude","actual", "rf_pred"]]
# reduce observations by limiting the years > 2020
ufo_cities = ufo_cities.loc[ufo_cities["year"] > 2020]
# drop missing values
ufo_cities = ufo_cities.dropna()
# make a cities_list to display in map
cities_list = ["Haiku","Honolulu","Los Angeles","San Francisco","Reno","Las Vegas","Sparks","Eugene",
"Salem","Portland,""Seattle","Lakewood","Auburn","Tucson","Boulder","Denver",
"Colorado Springs","Provo","Phoenix","Roswell","Albuquerque","Billings","Bismarck",
"Fargo","Pierre","Bozeman","Missoula","Casper","Douglas",
"Des Moines","Cedar Rapids","Sioux City","Cheyenne","Chicago","Tinley Park","Omaha",
"Sioux Falls","Detroit","Louisville","Columbus","Cleveland","Baltimore","Pittsburgh","St. Paul",
"Fairbanks","Anchorage","Grand Rapids","Pittsburgh","Philadelphia","Boise",
"Salt Lake City","Austin","Houston","San Angelo","Oklahoma","Wichita","Nashville",
"Memphis","New York","Brooklyn","Boston","Providence","Hartford","Myrtle Beach",
"Washington","Richmond","Charleston","Raleigh","Mobile","Atlanta","New Orleans",
"Augusta","Jacksonville","Miami","Tampa","Orlando"]
# filter the dataframe by city is in cities_list
ufo_cities = ufo_cities.loc[ufo_cities["city"].isin(cities_list)]
# print the map title
print("US Cities with Actual vs Predicted Frequencies of UFO Sightings")
# create a folium map object with Wichita as the location view
map = folium.Map(location=[38, -97], zoom_start=4)
# iterate through each city to set up a popup and tooltip showing appropriate actual & pred. values
for i, row in ufo_cities.iterrows():
folium.Marker(
location=[row["city_latitude"], row["city_longitude"]],
popup=f"{row['city']}, {row['state']} ({row['year']})<br>actual_freq={round(row['actual'],2)}%<br>pred_freq={round(row['rf_pred'],2)}%",
tooltip=f"{row['city']} {row['year']}<br>{round(row['actual'],2)}%"
).add_to(map)
# save and display the map
map.save("maps/us_cities_map.html")
map
US Cities with Actual vs Predicted Frequencies of UFO Sightings
# find latitude and longitude values for select cities
# PLEASE NOTE: I removed the API key for running this function so another API key needed to run this function
# assign_lat_lon() takes a dataframe, a city string, and a country string as arguments
# and assigns geocode values to the city_latitude and city_longitude columns
def assign_lat_lon(df, city, country):
# use the geocoder library to get the latitude and longitude values for the city
result = geocoder.osm(f"{city}, {country}")
if result.ok:
# make a boolean mask that is True for rows where city column == city argument and country column == country argument
mask = (df["city"] == city) & (df["country"] == country)
# assign the values to the dataframe using the loc method and the mask as the index
df.loc[mask, "city_latitude"] = result.lat
df.loc[mask, "city_longitude"] = result.lng
else:
# get latitude and longitude using GoogleV3's geolocater
geolocator = GoogleV3(api_key="--removed--")
# Use the geocode method to get the coordinates of an address
result2 = geolocator.geocode(f"{city}, {country}")
if result2:
# make a boolean mask that is True for rows where city column == city argument and country column == country argument
mask = (df["city"] == city) & (df["country"] == country)
# assign the values to the dataframe using the loc method and the mask as the index
df.loc[mask, "city_latitude"] = result2.latitude
df.loc[mask, "city_longitude"] = result2.longitude
return df
# copy preds_cities dataframe
ufo_cities = preds_cities.copy()
# subset the ufo_cities with columns needed
ufo_cities = ufo_cities[["city","state","country","city_latitude","city_longitude","year","actual","rf_pred"]]
# fill in the latitude and longitude values of rows missing city_latitude and ciy_longitude values
missing_df = ufo_cities.isna()
# use the any() method to check if there are any missing values in the city_latitude and city_longitude columns
missing_lat_lon = missing_df["city_latitude"].any() or missing_df["city_longitude"].any()
# if there are missing values, call assign_lat_lon() with city and country arguments;
# cities from the Americas and Europe are hard-coded in
if missing_lat_lon:
ufo_cities = assign_lat_lon(ufo_cities, "Gatineau","Canada")
ufo_cities = assign_lat_lon(ufo_cities, "Ontario","Canada")
ufo_cities = assign_lat_lon(ufo_cities, "Montreal","Canada")
ufo_cities = assign_lat_lon(ufo_cities, "Coquitlam","Canada")
ufo_cities = assign_lat_lon(ufo_cities, "Lingan","Canada")
ufo_cities = assign_lat_lon(ufo_cities, "Sudbury","Canada")
ufo_cities = assign_lat_lon(ufo_cities, "San Juan","Puerto Rico")
ufo_cities = assign_lat_lon(ufo_cities, "Sandy Point","St. Kitts")
ufo_cities = assign_lat_lon(ufo_cities, "San Sebastián","Puerto Rico")
ufo_cities = assign_lat_lon(ufo_cities, "Cayo Guillermo","Cuba")
ufo_cities = assign_lat_lon(ufo_cities, "Havana","Cuba")
ufo_cities = assign_lat_lon(ufo_cities, "San Jose","Costa Rica")
ufo_cities = assign_lat_lon(ufo_cities, "Farallon","Panama")
ufo_cities = assign_lat_lon(ufo_cities, "Panama","Panama")
ufo_cities = assign_lat_lon(ufo_cities, "Oaxaca City","Mexico")
ufo_cities = assign_lat_lon(ufo_cities, "Mexico City","Mexico")
ufo_cities = assign_lat_lon(ufo_cities, "Puerto Penasco","Mexico")
ufo_cities = assign_lat_lon(ufo_cities, "Cancun","Mexico")
ufo_cities = assign_lat_lon(ufo_cities, "Tijuana","Mexico")
ufo_cities = assign_lat_lon(ufo_cities, "Puerto Vallarta","Mexico")
ufo_cities = assign_lat_lon(ufo_cities, "Guatemala City","Guatemala")
ufo_cities = assign_lat_lon(ufo_cities, "Quito","Ecuador")
ufo_cities = assign_lat_lon(ufo_cities, "Managua","Nicaragua")
ufo_cities = assign_lat_lon(ufo_cities, "Lima","Peru")
ufo_cities = assign_lat_lon(ufo_cities, "Bogotá","Colombia")
ufo_cities = assign_lat_lon(ufo_cities, "Medellin","Colombia")
ufo_cities = assign_lat_lon(ufo_cities, "Caracas","Venezuela")
ufo_cities = assign_lat_lon(ufo_cities, "Buenos Aires","Argentina")
ufo_cities = assign_lat_lon(ufo_cities, "Rio de Janeiro","Brazil")
ufo_cities = assign_lat_lon(ufo_cities, "SÃO PAULO","Brazil")
ufo_cities = assign_lat_lon(ufo_cities, "Juazeiro Do Norte","Brazil")
ufo_cities = assign_lat_lon(ufo_cities, "Achumani","Bolivia")
ufo_cities = assign_lat_lon(ufo_cities, "Santiago","Chile")
ufo_cities = assign_lat_lon(ufo_cities, "Ceuse","France")
ufo_cities = assign_lat_lon(ufo_cities, "Paris","France")
ufo_cities = assign_lat_lon(ufo_cities, "Rome","Italy")
ufo_cities = assign_lat_lon(ufo_cities, "Florence","Italy")
ufo_cities = assign_lat_lon(ufo_cities, "Barcelona","Spain")
ufo_cities = assign_lat_lon(ufo_cities, "Málaga","Spain")
ufo_cities = assign_lat_lon(ufo_cities, "Lisbon","Portugal")
ufo_cities = assign_lat_lon(ufo_cities, "Venice","Italy")
ufo_cities = assign_lat_lon(ufo_cities, "Berlin","Germany")
ufo_cities = assign_lat_lon(ufo_cities, "Munich","Germany")
ufo_cities = assign_lat_lon(ufo_cities, "Warsaw","Poland")
ufo_cities = assign_lat_lon(ufo_cities, "Krakow","Poland")
ufo_cities = assign_lat_lon(ufo_cities, "Zurich","Switzerland")
ufo_cities = assign_lat_lon(ufo_cities, "Prague","Czech Republic")
ufo_cities = assign_lat_lon(ufo_cities, "Zagreb","Croatia")
ufo_cities = assign_lat_lon(ufo_cities, "Bucharest","Romania")
ufo_cities = assign_lat_lon(ufo_cities, "Helsinki","Finland")
ufo_cities = assign_lat_lon(ufo_cities, "Oslo","Norway")
ufo_cities = assign_lat_lon(ufo_cities, "Stockholm","Sweden")
ufo_cities = assign_lat_lon(ufo_cities, "Næstved","Denmark")
ufo_cities = assign_lat_lon(ufo_cities, "Reykjavík","Iceland")
ufo_cities = assign_lat_lon(ufo_cities, "Amsterdam","Netherlands")
ufo_cities = assign_lat_lon(ufo_cities, "The Hague","Netherlands")
ufo_cities = assign_lat_lon(ufo_cities, "Kiev","Ukraine")
ufo_cities = assign_lat_lon(ufo_cities, "Kyiv","Ukraine")
ufo_cities = assign_lat_lon(ufo_cities, "Sydney","Australia")
ufo_cities = assign_lat_lon(ufo_cities, "Brisbane","Australia")
ufo_cities = assign_lat_lon(ufo_cities, "Melbourne","Australia")
ufo_cities = assign_lat_lon(ufo_cities, "London","United Kingdom")
ufo_cities = assign_lat_lon(ufo_cities, "Essex","United Kingdom")
ufo_cities = assign_lat_lon(ufo_cities, "Manchester","United Kingdom")
ufo_cities = assign_lat_lon(ufo_cities, "Liverpool","United Kingdom")
ufo_cities = assign_lat_lon(ufo_cities, "Dublin","United Kingdom")
ufo_cities = assign_lat_lon(ufo_cities, "Alloa","United Kingdom")
ufo_cities = assign_lat_lon(ufo_cities, "Athens","Greece")
ufo_cities = assign_lat_lon(ufo_cities, "Kolymbia Beach","Greece")
ufo_cities = assign_lat_lon(ufo_cities, "Istanbul","Turkey")
# if there are missing values, call assign_lat_lon() with city and country arguments;
# cities from Asia, Africa, and Middle East
if missing_lat_lon:
ufo_cities = assign_lat_lon(ufo_cities, "Meerut","India")
ufo_cities = assign_lat_lon(ufo_cities, "Delhi","India")
ufo_cities = assign_lat_lon(ufo_cities, "Port Blair","India")
ufo_cities = assign_lat_lon(ufo_cities, "Bangalore","India")
ufo_cities = assign_lat_lon(ufo_cities, "Senji","India")
ufo_cities = assign_lat_lon(ufo_cities, "Lahore","Pakistan")
ufo_cities = assign_lat_lon(ufo_cities, "Kabul","Afghanistan")
ufo_cities = assign_lat_lon(ufo_cities, "Tokyo","Japan")
ufo_cities = assign_lat_lon(ufo_cities, "Ueda","Japan")
ufo_cities = assign_lat_lon(ufo_cities, "Yokohama","Japan")
ufo_cities = assign_lat_lon(ufo_cities, "SANYA","China")
ufo_cities = assign_lat_lon(ufo_cities, "Changchun","China")
ufo_cities = assign_lat_lon(ufo_cities, "Shanghai","China")
ufo_cities = assign_lat_lon(ufo_cities, "Beijing","China")
ufo_cities = assign_lat_lon(ufo_cities, "Hong Kong","China")
ufo_cities = assign_lat_lon(ufo_cities, "Nantou","Taiwan")
ufo_cities = assign_lat_lon(ufo_cities, "Taipei","Taiwan")
ufo_cities = assign_lat_lon(ufo_cities, "Singapore","Singapore")
ufo_cities = assign_lat_lon(ufo_cities, "Kuala Lumpur","Malaysia")
ufo_cities = assign_lat_lon(ufo_cities, "Koh Samui","Thailand")
ufo_cities = assign_lat_lon(ufo_cities, "Bangkok","Thailand")
ufo_cities = assign_lat_lon(ufo_cities, "Ho Chi Minh City","Vietnam")
ufo_cities = assign_lat_lon(ufo_cities, "Seoul","South Korea")
ufo_cities = assign_lat_lon(ufo_cities, "Manila","Philippines")
ufo_cities = assign_lat_lon(ufo_cities, "Angeles City","Philippines")
ufo_cities = assign_lat_lon(ufo_cities, "Cebu","Philippines")
ufo_cities = assign_lat_lon(ufo_cities, "Jakarta","Indonesia")
ufo_cities = assign_lat_lon(ufo_cities, "Auckland","New Zealand")
ufo_cities = assign_lat_lon(ufo_cities, "Wellington","New Zealand")
ufo_cities = assign_lat_lon(ufo_cities, "Smolensk","Russia")
ufo_cities = assign_lat_lon(ufo_cities, "Khabarovsk","Russia")
ufo_cities = assign_lat_lon(ufo_cities, "Tikhoretsk","Russia")
ufo_cities = assign_lat_lon(ufo_cities, "Moscow","Russia")
ufo_cities = assign_lat_lon(ufo_cities, "St. Petersburg","Russia")
ufo_cities = assign_lat_lon(ufo_cities, "Nairobi","Kenya")
ufo_cities = assign_lat_lon(ufo_cities, "Cairo","Egypt")
ufo_cities = assign_lat_lon(ufo_cities, "Johannesburg","South Africa")
ufo_cities = assign_lat_lon(ufo_cities, "Cape Town","South Africa")
ufo_cities = assign_lat_lon(ufo_cities, "Durban","South Africa")
ufo_cities = assign_lat_lon(ufo_cities, "Harare","Zimbabwe")
ufo_cities = assign_lat_lon(ufo_cities, "Casablanca","Morocco")
ufo_cities = assign_lat_lon(ufo_cities, "Addis Ababa","Ethiopia")
ufo_cities = assign_lat_lon(ufo_cities, "Lagos","Nigeria")
ufo_cities = assign_lat_lon(ufo_cities, "Algiers","Algeria")
ufo_cities = assign_lat_lon(ufo_cities, "Tripoli","Libya")
ufo_cities = assign_lat_lon(ufo_cities, "Tehran","Iran")
ufo_cities = assign_lat_lon(ufo_cities, "Baghdad","Iraq")
ufo_cities = assign_lat_lon(ufo_cities, "Haifa","Israel")
ufo_cities = assign_lat_lon(ufo_cities, "Jerusalem","Israel")
ufo_cities = assign_lat_lon(ufo_cities, "Rekhasim","Israel")
ufo_cities = assign_lat_lon(ufo_cities, "Tel-Aviv","Israel")
ufo_cities = assign_lat_lon(ufo_cities, "Mecca","Saudi Arabia")
ufo_cities = assign_lat_lon(ufo_cities, "Wadi Dhahal","Syria")
ufo_cities = assign_lat_lon(ufo_cities, "Amman","Jordan")
ufo_cities = assign_lat_lon(ufo_cities, "Muscat","Oman")
ufo_cities = assign_lat_lon(ufo_cities, "Dubai","UAE")
# International Cities Map (all non-usa cities)
# filter out the USA from ufo_cities
ufo_cities = ufo_cities.loc[ufo_cities["country"] != "USA"]
ufo_cities = ufo_cities.loc[ufo_cities["country"] != "Unknown"]
# subset the ufo_cities with columns needed
ufo_cities = ufo_cities[["year","city","state","country","city_latitude","city_longitude","actual","rf_pred"]]
# reduce observations by limiting the years > 2020
ufo_cities = ufo_cities.loc[ufo_cities["year"] > 2020]
# drop missing values
ufo_cities = ufo_cities.dropna()
# make a cities_list to display in map
cities_list = ["Gatineau","Ontario","Ottawa","Toronto","Montreal","Coquitlam","Essex","Edmonton",
"Calgary","Thamesford","New Westminster","Kingston","Ajax","Dartmouth","Sherwood Park",
"Brandon","Orangeville","Nanaimo","Riverview","Sudbury","Vancouver",
"Oaxaca City","Mexico City","San Jose","San Juan","Guayama","Sandy Point",
"San Sebastián","Panama","Farallon","Cayo Guillermo","Havana","Bogotá","Medellin",
"Puerto Vallarta","Santiago","Achumani","Puerto Penasco","Cancun","Tijuana",
"Guatemala City","Quito","Managua","Lima","Buenos Aires","Rio de Janeiro","Barcelona",
"Ceuse","Paris","Florence","Rome","Venice","Prague","The Hague","Bucharest",
"Helsinki","Oslo","Lisbon","Málaga""Caracas","Munich","Berlin","Warsaw","Zurich",
"Stockholm","Næstved","Reykjavík","Amsterdam","Kiev","Kyiv","Sydney","Melbourne",
"Brisbane","London","Manchester","Liverpool","Dublin","Alloa","Athens","Port Blair",
"Kolymbia Beach","Istanbul","Meerut","Delhi","Senji","Bangalore","Kabul","Lahore",
"Tokyo","Ueda","Yokohama","SANYA","Changchun","Shanghai","Beijing","Hong Kong",
"Singapore","Taipei","Nantou","Kuala Lumpur","Ho Chi Minh City","Seoul","Bangkok","Koh Samui",
"Manila","Angeles City","Cebu","Tikhoretsk","Jakarta","Smolensk","Khabarovsk",
"Moscow","St. Petersburg","Nairobi","Cairo","Johannesburg","Harare","Wellington",
"Auckland","Cape Town","Durban","Casablanca","Addis Ababa","Lagos","Algiers",
"Tehran","Baghdad","Haifa","Jerusalem","Tel-Aviv","Mecca","Wadi Dhahal","Amman",
"Muscat","Dubai","Tripoli"]
# print the map title
print("Map of Cities (Non_USA) with Actual vs Predicted Frequencies & Probabilities of UFO Sightings")
# create a folium map object (location Juba, Sudan)
map = folium.Map(location=[5, 31], zoom_start=2)
# add a marker for each city with a popup showing its name and percent probabilty
for i, row in ufo_cities.iterrows():
folium.Marker(
location=[row["city_latitude"], row["city_longitude"]],
popup=f"{row['city']}, {row['country']} ({row['year']})<br>actual_freq={round(row['actual'],3)}%<br>pred_freq={round(row['rf_pred'],3)}%",
tooltip=f"{row['city']} {row['year']}<br>{round(row['actual'],3)}%"
).add_to(map)
# save and display the map
map.save("maps/world_cities_map.html")
map
Map of Cities (Non_USA) with Actual vs Predicted Frequencies & Probabilities of UFO Sightings